« Setting up Cruisecontrol... | Main | Slab! Interoperation and evolution in web architecture »

WWW cubed: syndication and scale

The rise of RSS reminds us once again that the web doesn't scale, but it's not time to throw the towel in yet.

The boy who cried RSS

Recently Robert Scoble of Microsoft announced that RSS, the syndication format used for news and weblog feeds, doesn't scale. He said this based on costs incurred by blogs.msdn.com, a hugely popular aggregator of Microsoft technology oriented weblogs and news feeds. The problem was sufficiently severe that the sponsers of blogs.msdn.com took drastic measures to reduce the size of the content been accessed in order to lower the bandwidth costs, a step that has annoyed some users. Scoble concluded that RSS is "not scalable when 10's of thousands of people start subscribing to thousands of separate RSS feeds and start pulling down those feeds every few minutes (default aggregator behavior is to pull down a feed every hour)".

However if there are problems with scale, they do not lie at the level of the RSS or (the more recent) Atom formats. The Atom IETF working group has been discussing this recently and in its discussion the issue and potential solutions clearly lie with the protocol used to serve RSS, HTTP. The characteristics of RSS use only represent a manifestation of a deeper problem with the Web. We need to look at the Web infrastructure and design as to the causes.

It doesn't scale

The argument, "it doesn't scale", in it's worst form is an invitation not to think, and can be something of a dangerous and loaded accusation in technical communities, not unlike the way "you are in league with devil" used to be in village communities. It's certainly not an accusation to be throwing around causually. The suggestion is that "Doesn't scale" == "Bad", and that scale is an inherent goodness of some systems and not of others - a kind of zero sum game for technology. The truth is not so simple - some technologies do have better scaling characteristics than others, but most technologies can be made sufficiently scalable with some work. Werner Vogels of Amazon has said repeatedly that the implicit polling style of the Web doesn't scale. But, this doesn't mean he's lazy or unthoughtful - far from it. Vogels, before joining Amazon did massively distributed systems research; we can thus imagine he probably has a different idea of scale to most of us, in the same way Michael Schumacher has a different idea of what a fast car is to most of us.

The truth is most of us don't need the scale that Amazon, Ebay or blogs.msdn.com does. It would be silly and wasteful to buy all that bandwidth and computational horsepower and them watch it idle for 99.9% of the time. Yet this is what many people do or are told to do. They buy a lot of expensive hardware typically to scale (also to to have physical redundancy) but that hardware is doing nothing most of the time and represents in networking terms "overcapacity", or in financial terms, sunk capital costs (if you think depreciation on a car fleet is bad, server infrastucture will have you crying into your spreadsheet).

There are precendents for this kind of problem. The electrical industry in its early days had difficulty in catering for demand spikes rather than the average - power plants had to be designed in order to provide for maximum demand - but most of the time demand was minimal and the power plants were losing money. Electricity was not something easily or efficiently stockpiled like oil or coal, so the use of battery storage wasn't a viable option. Early workarounds included the invention of the electrical consumber goods industry, so that we would have a reason to consume electricity around the clock and smooth out demand. The real breakthrough lay in the development of national grids that allowed excess to flow to whereever it was needed. That is why to today in some places you can get your meter to run backwards if you feed a surplus of electricity into the grid. Today, a grid for computing is very popular, well-researched and well-funded idea. But it's not clear yet that Grid Computing, as it's known, will allow applications to function untethered from the limitation of bandwidth and computation, if only because application data is more localized and biased than electricity, and as such is less interchangeable - information is not yet a currency. It's also not in everyone's commercial interest to decouple applications to that level from the infrastructure they run on - the evolution of a computing grid can be expected to be fractious.

This is not news

Back to HTTP. Anyone who has worked with HTTP for a while will know it doesn't react well to traffic spikes. On the average the HTTP Web has scaled very in terms of its reach (it's a global network phenemenon). On the individual level of sites and site owners, it's not proven to scale as well. The problem has at least two names, The Curse of Popularity, and the Slashdot Effect.

For the Web to scale to its current levels has required both significant individual investment in servers and a massive investment and deployment of an almost invisible system of server caches and storage networks. To avail of this other network you pay handsomely. The result is that what most people think of as the web (web sites, out there), is in fact a logical and abstract architecture. Physically, due to caching networks and any number of tricks to keep things running, it works rather differently.

Even so, the characteristics of news and weblog aggregation have the potential to overwhelm what has been done so far. This is because what has been done so far was down to cater for human use of the web, not machines. Humans are very slow at accessing the web, but have always had the advantage of being able to read semi-structured HTML markup. Machine reading of HTML, commonly known as "scraping" has in the past been the provence of specialist tools and search engines such as Google. The advance of RSS and Atom markup has made reading content much easier for machines and as a result has seen a rise in automated applications that can and do download content at far greater frequencies than before (Indeed the author has of this piece has claimed in the past that the web and organisational intranets would come under increasing pressure due to the order of magnitude increases in traffic resulting from further automation). The impression users of RSS aggregators are left with is a push medium or semi-realtime update of news and content delivered direct to their computer. But that's the swan above the water line. Below the water line the aggregator is paddling furiously, frequently connecting and downloading content from dozens or even hundreds of sites, doing many times a day what would take a human days to do. It's as if the number of web users has started to grow exponentially again as it did in the mid-Nineties. However, much of the time this results in the same content being downloaded repeatedly on the offchance that anything has changed; a case of busy work.

Solutions beyond eminently clever and expensive caching techniques have been varied. Web servers based on different programing approaches to the popular servers (Apache, IIS) can scale to huge numbers of users, but these are not widely used, and can end up making matters complicated for application developers. In the syndication case, all a more capable web server means is that you will be hit for even greater bandwidth usage charges. This is because the problem is not so much supporting the number of visitors, but the number of times they are visiting.

In theory it would be much more efficient if the server could tell the aggregator what has changed. The thinking then tends to focus on dropping the Web and using alternative network protocols, such as the much maligned peer to peer (P2P) file sharing systems. It has been sometimes claimed of such systems that their ability to scale increases as more users (peers) join the network. Of these, Bittorrent represents perhaps the most viable candidate for integration into RSS usage - indeed Bittorrent was created to solve the problem of the Curse of Popularity. Another possibility is the use of instant message technologies such as XMPP as pioneered by the PubSub aggregator service. Yet another is the old NNTP system on which Usenet runs. However the key attraction of HTTP is its ubiquity and vast reach - people love using it and adminstrators let it past their firewalls, something that can't always be said for IM and P2P protocols.

The most advanced thinking that doesn't involve throwing out the Web is probably Rohit Khare's PhD thesis [pdf], which suggests an "eventing", or push style extension to the Web model. An early example of this approach where the server calls back to the connected client instead of the client initiating each time, called mod_pubsub is available as open source. One of HTTP's designers, Roy Fielding, is rumoured to be working on a new protocol, that could feature support for easing of the load on servers.

It's common to hear an argument along the lines that you should expect to pay for popularity on the Web. This is specious and self-serving, for two reasons. First, you only pay for popularity because the Web is architected in a way that massively favours consumers of content over producers. It's a series of design decisions that has things this teed up this way for the Web, not anything inherent to the Internet itself. Other protocols such as JXTA and Bittorrent have more even-handed characteristics. Second, it implicitly assumes producers in some way should have to pay for the right to be popular, as if popularity was due a levy, or a tax.

This aside, given the way the Web is today, you will pay for popularity whether you like it or not. There are many arcane sounding things you can to do to stave off the inevitable - gzip compression, delta-encoding, etags, last-modified headers, conditional gets - indeed, blogs.msdn has received some sharp criticism (inside and outside Microsoft) for not doing some of these things. But these do not address the fundamental problem - on the Web the burden of cost is born by the producer.

It's notable that such costs will tend to squeeze out the smaller, poorer voices. This alone should be sufficient to concern anyone interested in a democratic and globally accessible medium. Often these are just the voices one wants to hear. Yet, it's been like this since the web began; people who have something to say will stop saying it when it costs to much. The medium seems almost designed to manouevere a site into displaying advertising to pay its way (ironically disliked as they are by many web-savvy technologists). But the advent of RSS feeds have upped the stakes enough that the even the biggest content producers on the planet are concerned about the costs. It shouldn't surprise anyone that if this problem is addressed it will because those who can afford to, will refuse to.

Responsibilty

In all the talk about HTTP scalibility, it's easy to forget another 'ility' - responsiblity. Sean McGrath in his work on eGovernment systems has highlighted an interesting consequence of the kind of client server architecture that HTTP is predicated on - that the responsiblity of accessing, sending and downloading content is born by the client application and not the server*. If the server is not up, that's bad, but the job of getting the data moving around is squarely the client's. When you switch things around to a push based medium, the responsibility of delivery is now born in part by the server owner: "The question of responsibility especially in the event of operational issues arising becomes complex. With a pull delivery model on the other hand, organisational boundaries are crisp and clear." This may not matter for consumer applications, but a surprising number of important business systems and services are now based on HTTP data transfers. And many people believe that syndication technology like RSS and Atom will also be used for commercially consequential exchanges in the b2b, or "business to business" arena. Switching from a polling to a pushing mode, also confers a switching of responsibilities, and this might in time have far-reaching consequences where cost-efficiency is traded for risks, legal and financial. One day, your online bank might be morally and technically culpable for getting your bank statements to your computer. In that case, expect to sign even more of your rights away in the fine print.


* Disclosure: Sean McGrath is the CTO of the author's current employer, Propylon.


September 12, 2004 02:41 PM

Comments

Randy Charles Morin
(September 12, 2004 05:03 PM #)

Another problem that exist only in the mind of bad programmers.

Bill de hra
(September 12, 2004 05:52 PM #)

"Another problem that exist only in the mind of bad programmers."

Randy, I don't understand; can you explain?

Randy Charles Morin
(September 12, 2004 06:52 PM #)

i8 w2ill but my ke3ybo9ar4d i8s so9 f'e3d r4i8ght no9w2 that I* can't typ0e3 mo9r4e3 than sho9r4t se3nte3nce3s :P)

Alan Little
(September 13, 2004 11:18 AM #)

I wonder if the "bad programmers" in the mind of Randy might, in this case, be those who write RS aggregators that disregard HTTP 304? That's certainly a major issue with aggregators that poll frequently for content that changes infrequently (bloglines, for eample, polls me every fifteen minutes for a feed that changes at most once a day). That paricular problem would be easily fixable, although it only addresses the issue at a rather coarse-grained level. "Only get the whole thing if anything in it has changed" is better than "always get the whole thing regardless", although not as good as "only get the changed bits".

Also reminds me of a conversation I had with the developers of the client end of a web app where I was designing the server side API:

Them: we can't see the API for getting a dump of the entire database to refresh our cache.

Me: there isn't one.

Them: could you add one?

Me: No. Only about 10% of the entries in the datbase are actually active, and less than 1% of them change on any given day.

Them: But we could issue update queries every day for every entry individually, say at three o'clock in the morning?

Me: (Sigh) If doing that that would make you happy, we probably wouldn't block it.

Randy Charles Morin
(September 16, 2004 04:39 PM #)

Sara has admitted that I was right all along. There was never a problem w/ RSS.

http://www.kbcafe.com/rss/?guid=20040916082939

So let's put this to rest once and for all. No more wolf crying.

Jon Hanna
(September 16, 2004 11:19 PM #)

I'm not sure abusing E-Tags counts as something good programmers would do. The general approach is sound, but expecting all compliant clients to work well with the specifics isn't. Giving that it breaks public caching it isn't even a particularly attractive long-term approach to the solution.

The "Boy Who Cried Wolf" is a good analogy I think, the 3 or 4 false alarms which were particularly damaging because there really was a wolf.

Randy Charles Morin
(September 17, 2004 01:21 AM #)

The vary Etag technique is not for everybody, as Sam explained. It's only for people whose RSS has extremely high turnover that results in little benefit of using regular Etags. This same high turnover would result in zero benefit from public caches, so to say it breaks public caches is misleadling. There would be little public caching anyhow, so there's nothing to break.

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1415

Listed below are links to weblogs that reference WWW cubed: syndication and scale:

» Responsibility is Complex in the Now Economy from The Now Economy
Courtesy of Mike Dierken we found Bill de hÓra's "WWW cubed: syndication and scale", in which he writes: The most advanced thinking that doesn't involve throwing out the Web is probably Rohit Khare's PhD thesis, which suggests an 'eventing', or... [Read More]

Tracked on September 15, 2004 10:24 PM

» Strawman Argument? from Radovan Janecek: Nothing Impersonal
Mark comments right: software architecture as a CS discipline providing sufficient means for comparisons. Why the hindustry is not doing it? I don't know. Personally, I'm not doing it because I don't have time ;-) However, I had spent a lot of time wit... [Read More]

Tracked on September 22, 2004 11:07 PM

» Bill de hra: WWW cubed: syndication and scale from Shy's Linkblog
Bill de hra: WWW cubed: syndication and scale... [Read More]

Tracked on October 4, 2004 09:43 PM