WWW cubed: syndication and scale
The rise of RSS reminds us once again that the web doesn't scale, but it's not time to throw the towel in yet.
The boy who cried RSS
Recently Robert Scoble of Microsoft announced that RSS, the syndication format used for news and weblog feeds, doesn't scale. He said this based on costs incurred by blogs.msdn.com, a hugely popular aggregator of Microsoft technology oriented weblogs and news feeds. The problem was sufficiently severe that the sponsers of blogs.msdn.com took drastic measures to reduce the size of the content been accessed in order to lower the bandwidth costs, a step that has annoyed some users. Scoble concluded that RSS is "not scalable when 10's of thousands of people start subscribing to thousands of separate RSS feeds and start pulling down those feeds every few minutes (default aggregator behavior is to pull down a feed every hour)".
However if there are problems with scale, they do not lie at the level of the RSS or (the more recent) Atom formats. The Atom IETF working group has been discussing this recently and in its discussion the issue and potential solutions clearly lie with the protocol used to serve RSS, HTTP. The characteristics of RSS use only represent a manifestation of a deeper problem with the Web. We need to look at the Web infrastructure and design as to the causes.
It doesn't scale
The argument, "it doesn't scale", in it's worst form is an invitation not to think, and can be something of a dangerous and loaded accusation in technical communities, not unlike the way "you are in league with devil" used to be in village communities. It's certainly not an accusation to be throwing around causually. The suggestion is that "Doesn't scale" == "Bad", and that scale is an inherent goodness of some systems and not of others - a kind of zero sum game for technology. The truth is not so simple - some technologies do have better scaling characteristics than others, but most technologies can be made sufficiently scalable with some work. Werner Vogels of Amazon has said repeatedly that the implicit polling style of the Web doesn't scale. But, this doesn't mean he's lazy or unthoughtful - far from it. Vogels, before joining Amazon did massively distributed systems research; we can thus imagine he probably has a different idea of scale to most of us, in the same way Michael Schumacher has a different idea of what a fast car is to most of us.
The truth is most of us don't need the scale that Amazon, Ebay or blogs.msdn.com does. It would be silly and wasteful to buy all that bandwidth and computational horsepower and them watch it idle for 99.9% of the time. Yet this is what many people do or are told to do. They buy a lot of expensive hardware typically to scale (also to to have physical redundancy) but that hardware is doing nothing most of the time and represents in networking terms "overcapacity", or in financial terms, sunk capital costs (if you think depreciation on a car fleet is bad, server infrastucture will have you crying into your spreadsheet).
There are precendents for this kind of problem. The electrical industry in its early days had difficulty in catering for demand spikes rather than the average - power plants had to be designed in order to provide for maximum demand - but most of the time demand was minimal and the power plants were losing money. Electricity was not something easily or efficiently stockpiled like oil or coal, so the use of battery storage wasn't a viable option. Early workarounds included the invention of the electrical consumber goods industry, so that we would have a reason to consume electricity around the clock and smooth out demand. The real breakthrough lay in the development of national grids that allowed excess to flow to whereever it was needed. That is why to today in some places you can get your meter to run backwards if you feed a surplus of electricity into the grid. Today, a grid for computing is very popular, well-researched and well-funded idea. But it's not clear yet that Grid Computing, as it's known, will allow applications to function untethered from the limitation of bandwidth and computation, if only because application data is more localized and biased than electricity, and as such is less interchangeable - information is not yet a currency. It's also not in everyone's commercial interest to decouple applications to that level from the infrastructure they run on - the evolution of a computing grid can be expected to be fractious.
This is not news
Back to HTTP. Anyone who has worked with HTTP for a while will know it doesn't react well to traffic spikes. On the average the HTTP Web has scaled very in terms of its reach (it's a global network phenemenon). On the individual level of sites and site owners, it's not proven to scale as well. The problem has at least two names, The Curse of Popularity, and the Slashdot Effect.
For the Web to scale to its current levels has required both significant individual investment in servers and a massive investment and deployment of an almost invisible system of server caches and storage networks. To avail of this other network you pay handsomely. The result is that what most people think of as the web (web sites, out there), is in fact a logical and abstract architecture. Physically, due to caching networks and any number of tricks to keep things running, it works rather differently.
Even so, the characteristics of news and weblog aggregation have the potential to overwhelm what has been done so far. This is because what has been done so far was down to cater for human use of the web, not machines. Humans are very slow at accessing the web, but have always had the advantage of being able to read semi-structured HTML markup. Machine reading of HTML, commonly known as "scraping" has in the past been the provence of specialist tools and search engines such as Google. The advance of RSS and Atom markup has made reading content much easier for machines and as a result has seen a rise in automated applications that can and do download content at far greater frequencies than before (Indeed the author has of this piece has claimed in the past that the web and organisational intranets would come under increasing pressure due to the order of magnitude increases in traffic resulting from further automation). The impression users of RSS aggregators are left with is a push medium or semi-realtime update of news and content delivered direct to their computer. But that's the swan above the water line. Below the water line the aggregator is paddling furiously, frequently connecting and downloading content from dozens or even hundreds of sites, doing many times a day what would take a human days to do. It's as if the number of web users has started to grow exponentially again as it did in the mid-Nineties. However, much of the time this results in the same content being downloaded repeatedly on the offchance that anything has changed; a case of busy work.
Solutions beyond eminently clever and expensive caching techniques have been varied. Web servers based on different programing approaches to the popular servers (Apache, IIS) can scale to huge numbers of users, but these are not widely used, and can end up making matters complicated for application developers. In the syndication case, all a more capable web server means is that you will be hit for even greater bandwidth usage charges. This is because the problem is not so much supporting the number of visitors, but the number of times they are visiting.
In theory it would be much more efficient if the server could tell the aggregator what has changed. The thinking then tends to focus on dropping the Web and using alternative network protocols, such as the much maligned peer to peer (P2P) file sharing systems. It has been sometimes claimed of such systems that their ability to scale increases as more users (peers) join the network. Of these, Bittorrent represents perhaps the most viable candidate for integration into RSS usage - indeed Bittorrent was created to solve the problem of the Curse of Popularity. Another possibility is the use of instant message technologies such as XMPP as pioneered by the PubSub aggregator service. Yet another is the old NNTP system on which Usenet runs. However the key attraction of HTTP is its ubiquity and vast reach - people love using it and adminstrators let it past their firewalls, something that can't always be said for IM and P2P protocols.
The most advanced thinking that doesn't involve throwing out the Web is probably Rohit Khare's PhD thesis [pdf], which suggests an "eventing", or push style extension to the Web model. An early example of this approach where the server calls back to the connected client instead of the client initiating each time, called mod_pubsub is available as open source. One of HTTP's designers, Roy Fielding, is rumoured to be working on a new protocol, that could feature support for easing of the load on servers.
It's common to hear an argument along the lines that you should expect to pay for popularity on the Web. This is specious and self-serving, for two reasons. First, you only pay for popularity because the Web is architected in a way that massively favours consumers of content over producers. It's a series of design decisions that has things this teed up this way for the Web, not anything inherent to the Internet itself. Other protocols such as JXTA and Bittorrent have more even-handed characteristics. Second, it implicitly assumes producers in some way should have to pay for the right to be popular, as if popularity was due a levy, or a tax.
This aside, given the way the Web is today, you will pay for popularity whether you like it or not. There are many arcane sounding things you can to do to stave off the inevitable - gzip compression, delta-encoding, etags, last-modified headers, conditional gets - indeed, blogs.msdn has received some sharp criticism (inside and outside Microsoft) for not doing some of these things. But these do not address the fundamental problem - on the Web the burden of cost is born by the producer.
It's notable that such costs will tend to squeeze out the smaller, poorer voices. This alone should be sufficient to concern anyone interested in a democratic and globally accessible medium. Often these are just the voices one wants to hear. Yet, it's been like this since the web began; people who have something to say will stop saying it when it costs to much. The medium seems almost designed to manouevere a site into displaying advertising to pay its way (ironically disliked as they are by many web-savvy technologists). But the advent of RSS feeds have upped the stakes enough that the even the biggest content producers on the planet are concerned about the costs. It shouldn't surprise anyone that if this problem is addressed it will because those who can afford to, will refuse to.
Responsibilty
In all the talk about HTTP scalibility, it's easy to forget another 'ility' - responsiblity. Sean McGrath in his work on eGovernment systems has highlighted an interesting consequence of the kind of client server architecture that HTTP is predicated on - that the responsiblity of accessing, sending and downloading content is born by the client application and not the server*. If the server is not up, that's bad, but the job of getting the data moving around is squarely the client's. When you switch things around to a push based medium, the responsibility of delivery is now born in part by the server owner: "The question of responsibility – especially in the event of operational issues arising – becomes complex. With a pull delivery model on the other hand, organisational boundaries are crisp and clear." This may not matter for consumer applications, but a surprising number of important business systems and services are now based on HTTP data transfers. And many people believe that syndication technology like RSS and Atom will also be used for commercially consequential exchanges in the b2b, or "business to business" arena. Switching from a polling to a pushing mode, also confers a switching of responsibilities, and this might in time have far-reaching consequences where cost-efficiency is traded for risks, legal and financial. One day, your online bank might be morally and technically culpable for getting your bank statements to your computer. In that case, expect to sign even more of your rights away in the fine print.
* Disclosure: Sean McGrath is the CTO of the author's current employer, Propylon.



