Scaling XMPP and Pub/Sub

Jack Moffit: "Sorry, Twitter. Until we see some answers, you don’t have data, just a big mouth."

I think Jack Moffit, always excellent, is being hard on Alex Payne and the Twitter gang. He is criticising Twitter for restricting access to the firehose - the XMPP stream of events - "tweets" in Twitter parlance. Jack alludes to a strategic reason for this, as in - twitter 'own' the data and therefor should own the derivative value obtained from analysing or reorganising the data.

"I don’t know the exact time that they started pruning the list of consumers of the firehose, but to me it seemed like this starting happening after Summize was acquired or around that time. The logical conclusion from this is that Twitter does not want more interesting things being built on top of its data."

I'm guessing the reason is not just that, although we do know that Twitter will be announcing a business plan a few quarters out. The other reason might be scale.

Scale? It's received wisdom that heavy HTTP polling is stupid and wrong, whereas push is both more efficient and more optimal. The problem is there isn't much science or shared field experience on what it means to have a public XMPP data and notification endpoint with a lot of subscribers. When I say a lot, I mean 250,000 to 1M clients holding open connections to your server(s).   Issues I've seen are that load balancing becomes a problem, db access costs dominate login times for clients, and XMPP server clustering isn't as far along as I'd thought it was. Scaling XMPP does not appear to be a commodity problem the way HTTP scaling is - you are back down to looking at whether/if the servers are using epoll/nio; whether load balancing should be done by clients (remember the load balancers actually get in the way), how long it takes to log a user in, set up presence, rosters etc; what the cluster toplogy's graph connectivity measure is (S2S doesn't seem to be the answer). It's like being back in 2000 and wistfully reading Dan Kegel's c10k page.

My suspicion is that services pushing out notifications to a number if subscribers (Sn) where that number is large is not yet a panacea to web poll scaling issues because there is latent asymmetry in the costs of pushing out events to increasing numbers despite it being more peformant and less latent for smaller values of Sn. And that service providers will need to look carefully at graph theory, flooding and gossip/propogation models to get pub/sub notifications to meet web scale delivery - and at that point we'll be half-way to either a peer to peer model, or usenet - take your pick ;)



    For what it's worth, Twitter wasn't and isn't pushing the firehose to 250k clients. Guessing, since I don't know exactly how many clients there were when they shut it down, they were pushing to fewer than 25 clients. They shouldn't be planning to push the full firehose to more than 1000 clients, because in order to truly make use of the firehose, a service would need to "otherwise receive" some very large percentage of updates; someone tracking out every occurrence of "obama" does not need to see every update. n remains small for the firehose.

    Also, a small clarification: Twitter themselves don't need to keep open all those client connections. XMPP's use of s2s means that the connection characteristics look closer to usenet than p2p. Also, I would never imagine that Twitter (nor any other large web service) would provide client connections (c2s), and thus is not subject to the various roster / login / database costs.

    Ultimately, when I built the firehose, it was an experiment. I honestly don't see a lot of value in it for Twitter, beyond a low-cost way to provide data to large providers who truly care about *all* the data, along the lines of LiveJournal's Atom feed.

    I've always been much more interested in the real-time track (i.e., faster than what Summize/ provides) and further, the federated web services that are made possible by XMPP. The characteristics of those services are, as you say, closer to P2P or Usenet. I'd actually venture that they're the same as SMTP, just with more, smaller traffic running over it. Thankfully, scaling SMTP is a well-understood problem at this point. ;-)

    Probably what will have to happen is using a method of minimizing what needs to be stored on each server and using a level of fan out machines (that speak to the clients) and source machines.
    Depending upon total traffic, multicast might be useful.