ArghPC

Steve Loughran: "Hadoop uses RPC to chat between nodes; everything has custom serialization and the heartbeats include data -tricks that work on a LAN. But they have a hard time keeping clients and clusters in Sync: what makes a workable and efficient protocol for the gigabit LAN in the datacentre is not appropriate for client-cluster comms, not when the clients aren't under control, or when they are a distance away. I'd like to work on a good REST api there -something like S3FS for storage, a pure REST model for Jobs too."

Me too. This issue (RPC) comes up for me from time to time - how to have a tight call model for what are essentially control messages, across a cluster of servers, without getting into 

  • Binary lockstep (eg fine grained RMI or shunting Python pickles around) and the risk of a cyclic dependency so large you can barely see it.   
  • The inability to examine problems with standard tools like wireshark without a decodec. This applies to any binary wire protocol, including base 64'd XML.
  • Designing some cross language type system

I'm down to 3 options:

  • Forget RPC, use HTTP calls to post state
  • Use XMPP messaging and pub/sub
  • Use RPC but with JSON as your wire format.

All of which abide by the notion that if you must ship non-documents, then ship using a handful of data structures (list, dict) and a limited number of scalar types (unicode strings, numbers, iso-dates, booleans). In other words JSON is the sweet spot of type driven interop.

Steve Vinoski: "For years we’ve known RPC and its descendants to be fundamentally flawed, yet many still willingly use the approach. Why? I believe the reason is simply convenience. Regardless of RPC’s well-understood problems, many developers continue to go down the RPC-oriented path because it conveniently fits the abstractions of the popular general-purpose programming languages they limit themselves to using. Making a function or method call to a remote or distributed function, object, or service appear just like any other function or method call allows such developers to stay within the comfortable confines of their language. Those who choose this approach essentially decide that developer convenience and comfort is more important than dealing with hard distribution issues like latency, concurrency, reliability, scalability, and partial failure."

All important, but binary on the wire messages and lockstepped upgrades are a massive problem as well. IOW a core practical issue with RPC is sending non-text around.

It's interesting then, that Facebook thrift has gone into the Apache Incubator. It looks sort of like JSON but has so-90s stuff like signed integer types.

Tags:

    tags:

11 Comments


    So what you're saying is, for binary communications to work the following needs to be solved:

    1. Binary Lockstep. Versioning of binary data structures is big and complex and anytime you want to change you've got to upgrade every participating system.
    2. Binary tools. To debug binary data requires tools to read and write binary data using editors that understand the format.
    3. Binary type system. Ability to describe binary data that is independant of languages.
    4. Binary type system language bindings. Ability to bind a binary type system with various languages simply.
    5. Tight binding. Ability to tightly bind a request/response pair for a set service.

    Is there anything else required to make binary encodings to work in non-trivial distributed systems?


    The same difficulties apply just as much to text formats. Binary vs. text is just a bit-stream format decision. What makes JSON flexible is it represents trees of dictionaries and uses manifest typing for primitive types. They leave it up to the receiver to extract what they want from those data structures. You can represent trees, dictionaries and manifest typing of primitive values in binary too. A binary representation is much easier to parse (unless syntax of your interpreted language happens to correspond exactly to the on-the-wire format).


    One big thing about binary stuff (and often SOAP) is that the tools try to make RPC look just like local PC, then you get unrealistic expectations about what the far end can see. If you print stuff into a json form, you can be fairly sure that the far end will not interpret it as a serialized object.


    Just yesterday I started a wireshark sniff to debug a weird HTTP problem, only to realize I had a HTTPS problem. IOW binary transfer again. Sigh.


    http://code.google.com/p/httpmr/ ?


    json-rpc is ok, if all you need is a json protocol and http transport for doing rpc.


    "The same difficulties apply just as much to text formats. Binary vs. text is just a bit-stream format decision."

    MM - no; because when it comes to dealing with field issues, text can be read by operators, cut and pasted into emails and IM. The whole human toolchain is made available by cleartext.

    "You can represent trees, dictionaries and manifest typing of primitive values in binary too."

    Yes you can, but this misses the point I made above in the way Turing-equivalence arguments miss the point of Ruby, or Lisp.


    "json-rpc is ok, if all you need is a json protocol and http transport for doing rpc."

    rektide - what else might you need?


    "Is there anything else required to make binary encodings to work in non-trivial distributed systems?"

    David, yes - the ability for humans to read binary off the wire.


    Google Protocol Buffers?


    I would second James in saying "Protocol Buffers" I'm currently in the process of evaluating this for our RabbitMQ driven distributed system, and it's nice that one template will generate to multiple languages.


Comments are closed for this entry.