« RDF pixie dust | Main | XML options in Java and .NET »

Under the hood at PubSub

I got into an email conversation with Bob Wyman a while back about the PubSub feed aggregator. With his permission I'm blogging about the PubSub architecture and internal processing model.

Bob asked that I don't paint a negative picture of being anti-XML and I hope I haven' t done that - PubSub doesn't strike me as anything other than great service. For those of you that aren't XML obsessives, Bob has taken some heat in the XML community over the last year for promoting binary infoset approaches. So when I asked if he was using binfosets, he responded:

It's not binfoset exactly. What we've got is set of machines that talk to the outside world and convert XML to and from the ASN.1 PER binary encodings that we use internally. (We use OSS Nokalva's ASN.1 tools) The result is great compression as well as extremely fast parsing. In an application like ours, we have to do everything we can to optimize throughput and XML while XML is really easy for people to generate, it just takes too much resource to push around and parse. Currently, we're monitoring over 1 million blogs. Since we're still pretty new, we've still got fewer than 10,000 subscriptions, so there is no real load on the system. We're usually matching at a rate of about 2.5 to 3 billion matches per day and the CPU on our matching engine is basically idling. (i.e. 3-5% most of the time). This is, of course, in part due to the work we put into optimizing the real-time matching algorithm (we need to match *every* subscription against every new blog entry). However, it is also in part because the matching engine never needs to do the string parsing that XML would require.

It's worth noting that all this is internal to PubSub; the public server I/O is XML.

On XML v Binfosets and the processing model:

My comments should not be read as "anti-XML". I'm simply pointing out a method of working with XML in a high volume environment. Just as people will often convert XML to DOM trees or SAX event streams when processing within a single process or box, what we do is convert to ASN.1 PER when processing within our "box." The fact that our "box" is made up of multiple boxes is, architecturally, no different from what would be the case if we had one thread parsing XML and another working with the DOM or binfoset that resulted from the parse. Our "threads" are running on different machines connected via an internal high-speed network and we pass data between the "threads" as ASN.1 PER-encoded PDUs -- not DOM trees or SAX events.

On PubSub metrics:

As it turns out, the problem of monitoring blog traffic is much easier than it might look. Imagine, if you will, that every one of 1 million blogs was updated twice a day -- giving 2 million updates (much more than what really happens). That is still only an average of 23 updates per second. 23 updates per second isn't a tremendous amount of traffic to handle. It is likely that even an "all XML" service could handle such load although such a system would have much less "headroom" than our system does and would need to scale to multiple matching engines sooner than we will. But, hardware is cheap... For most people, buying more hardware will be more cost effective than going through all the complexity and algorithm tuning that we've had to do. We spend a great deal of time working on the throughput since we expect to be getting much higher volumes of traffic from non-blog sources in the future.

The hardware statement is interesting; it seems to align with the Google view of using commodity boxes while keeping the smarts into software.

On scalability:

There are certainly many examples of XML based systems that handle reasonable amounts of traffic with no problem. Thus, it is likely that there aren't going to be many applications that require the kind of optimization effort that we're forced to make. Nonetheless, it should be recognized that there comes a point where it becomes wise to do something other than process XML directly at all points in a system.

On the value of XML:

I'd also like to make sure you know that there is no question about my appreciation of the strengths of XML. There is no question that if we required all our inputs to be in anything other than XML, we would have virtually no input to work with. XML is so easy for people to generate that the net is literally overflowing with the stuff and there is still much more to come. It may be malformed, filled with namespaced additions (which is often no more than noise...), etc. but we can still manage to make sense of most of what we receive. Things would be cleaner if all data came to us in more strictly defined formats, but it is better to get messy data then no data.

On future interfaces into PubSub:

We will, in fact, be asking some high volume publishers to send us their data using ASN.1 encodings. However, the encodings we ask for will be directly mappable to XML schemas and XML will always be considered a completely interchangeable encoding format. In this we try to stay encoding-neutral. Also, we are already seeing that more compact encodings may be appropriate when delivering data to devices that are on the end of low-bandwidth connections or that have resource requirements that demand ease of parsing. Also, we'll be sending ASN.1 encoded stuff two and from clients that we write ourselves (while allowing XML to be used if one of those clients talks to someone else's XML based server.) Thus, anyone who wants to view our system as "XML only" will be able to do so and anyone who wishes to treat it like an ASN.1 based system will also be able to do so. We will be, as I said before, encoding-neutral.

The main thing I take from Bob's explanations is that PubSub, along with being a fine service, is doing a good job of separating interoperability issues from performance ones, by sticking to XML at the system/web boundary and leveraging ASN.1 PER internally. That helps reduce XML-Binfoset controversy to a kerfuffle. PubSub not the only one working along these lines - Antartica (Tim Bray is on the board) also consumes and produces XML, but internally converts the markup to structures optimized for the algorithms required for generating visual maps. Similarity Systems Athanor allows you to describe data matching plans in XML, but again is converting to optimized data structures when it comes to making matches. The key mistake in interop terms seems to be wanting to distort XML to fit the binary/api worldview or replace it wholesale at the system edges.


February 28, 2004 11:55 AM

Comments

Bob Wyman
(February 28, 2004 07:22 PM #)

Bill, Your trackback links seem to be broken, so I'm using a comment. I've expanded a bit on what I said at my blog.

See: http://bobwyman.pubsub.com/main/2004/02/xml_asn1_and_th.html

Note: We've expanded beyond simply providing subscriptions to content from the over 1 million weblogs that we monitor. Currently, we also support quite a number of newsgroups as well as the SEC Edgar Financial Filings information. We'll be adding more sources in the future. Thus, the number of messages we handle has gone up significantly since our email exchange. In fact, blog content is now less than 50% of what we carry...

I remain confident that if we had been relying on a "pure" XML system, we would require significantly more hardware than we currently do. (We still only use one Intel-based Linux box for all subscription "matching" at this point but it never gets CPU utilization more than 5-10%... Given that we're typically running at a rate of between 4 and 8 billion matches per day, we're happy that parsing and encoding overhead is virtually non-existant in our internal system. All XML handling is done at the "edges" -- where it belongs.) The marriage of XML and ASN.1 is working beautifully for us.

bob wyman

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1172