« Rogue plane over Frankfurt | Main | OWL classloading: a language smell? »

Pushing tin: the slashdot effect, web services and web infrastructure.

[ t e c h n o \ c u l t u r e ]

Karlin Lillington links to a Kuro5hin article on how linking can get a site slashdotted:

The ethics of linkage. "If you read "meta" sites like Slashdot, Kuro5hin, Fark, Met4filter (natch), and Memepool you've probably encountered links to stories that you can't reach -- namely because the act of linking to a server not prepared for massive traffic has brought down the server, or worse, put the hapless soul over their bandwidth cap denying any use to anyone for the rest of the month or day or whatever time period the ISP or hosting provider uses to allocate bandwidth."

Many will know this as the 'Slashdot effect', when a thundering herd of readers follow a link on slashdot.org's site to another site, the site's server melts down and the majority can't actually hit the site. The article suggest the answer is think before you link. thoughtful idea, but not really workable.

Interesting that Alan Mather blogged recently on the UK's Environment Agency's website site going down under load, and what we (particulary e-Government, since that's Alan's field of expertise) can learn from it:

The heavy rain over the last few days has meant that the Environment Agency's website that gives details on which areas are likely to be flooded has been overwhelmed with demand and is presently down.

Alan offers three options, none of which are explicitly ethical:

1. Robust Design

If you know this kind of thing is going to happen, you design your site to take that into account. [...]

2. Centralisation

If the economics at a local level or departmental level don't justify the kind of spend on resilience that's required, then you move the content and the applications somewhere that does.

3. Syndication

The science of syndication is not well understood for things like this, but it's certainly feasible that the main pieces of content could be offered up to a variety of major sites so that no single site is hit heavily.

These are good suggestions, but remain problematic in that they treating symptoms rather than causes And syndication is better understood that Alan suggests (we'll talk about content delivery networks below), but is not yet part of the web and internet protocols.

Ironically, most have as as likely as not have more than enough computing capacity to cater for events like a website being slashdotted. It just happens that the capacity is not in the right place the right time. Limited bandwidth by the way is not a major concern. Bandwidth economics are interesting in their own right, but a lack of bandwidth is not the issue when it comes to sites falling over (albeit bandwith is not as well distributed as we might like). Instead, the problem here is very much one of deployed computing infrastructure, not neccessarily best solved by linking ethics or buying more kit.

The architects estimate (guess) the maximum traffic a website can expect and buy for that case, or as much as can be afforded. You don't dare buy for the average or median cases. Worst of all, the common case for the majority of sites is typically a trickle of hits that could be handled by a six year old computer pulled out of a skip. And there's a reasonable chance that when you do get heavily hit, your maximum estimates will be too low, perhaps by an order of magnitude. Good news, if your business is pushing tin. The end result is that organizations and individuals are paying too much for running applications on the web, organizations for server infrastructure and development costs, individuals for bandwidth.

The rest of this piece looks at two of the usual suspects for the Slashdot effect, and one that will come to town, soon enough. The point is that Slashdot itself is not a suspect; at worst it's a messenger.

The protocol

HTTP cognoscenti are quick to point out that HTTP 1.1 has a lot to say about caching web resources and that the web has scaled fantastically well. Both are true, but only on a macro level, and that even requires a specific interpretation of 'scalablity', closer to 'ubiquity' than any ability to ramp up efficiently against demand. HTTP was simply not designed to distribute load at the speed a site can get stampeded today.

The bitter truth is this: at the micro level, any individual site is punished directly in proportion to the perceived value of its content. The web does not offer an economy of scale, quite the opposite. Being popular is expensive, being very popular may prove to be a website's undoing. The deployed HTTP infrastructure is clearly not able to deal with the Slashdot effect, where the burden of cost is levied on the supplier of content. Given that the web is a meant to be a democratic, enabling medium, there's surely an impedence mismatch here.

One of the reasons the web scales at all as it does is not due to the implicit nature of the internet but due to an underground sea of machinery known as Content Delivery Networks. CDNs live in a twilight zone between the TCP/IP transport layer and the HTTP application layer, caching and moving static content around the web nearer to where the demand is. You have to pay to place content on these networks, they're not part of the web as designed. Efforts are under way to standardize CDN protocols, but ultimately this is renting tin rather than buying it and may not be the best long term approach.

The servers

Claims of the web's scalability or facilities for caching in the HTT protocol are irrelevant when your servers melt down. Beyond protocol design, the immediate technical problems with sites falling over are to do with how webservers are designed. We're only beginning to properly understand the characteristics of web topology and the nature of web traffic. Most web server software was designed and deployed before this understanding - their principles often go back to operating systems research twenty years old.

The vast majority of deployed web servers are built to use what is known as the 'thread per request' model. In essence each request is giving slices of computing resources and will time-share with others for the CPU (this model is also the basis for CORBA and J2EE server architectures, which may help explain why it can be so expensive to make them highly performant). The singlemost interesting characteristic of the model is that the computing resources required is made directly proportional to the number of incoming requests. When enough requests come in the server must either generate new threads, or quickly turn around in use threads for new requests. This model made sense once upon a for time-shared operating systems and mainframes, but much less so now for the web.

To make an analogy. Imagine a restaurant gets an excellent review in the paper. Everyone wants to go there and eat. The thing about this restaurant is that here, each diner gets a personal chef. The chef takes the order, brings the drinks, cooks the food, serves it, runs the bill, and washes up afterwards. With enough chefs on the go, they'll start banging into each other on the floor, slowing each other down, spilling drinks, arguing in the kitchen over who gets what pot, fighting over the next clean plate, not putting out the garbage as it piles up because no-one has the time. When enough diners enter the restaurant, it will either keep taking new diners until the chefs logjam each other and service grinds to a halt, or will close its doors to new customers until someone pays the check and leaves. When one or the other happens, the restaurant had been slashdotted. Everybody who gets service gets poor service, some people leave without telling their chef, whose time has then been entirely wasted making an uneaten meal, most people don't get in, and the restaurant reputation is in tatters. Sending people to another eaterie helps, and represents a clustering of servers to balance the load. But if 95% of the time you only have a half full restaurant, you have to wonder does it make sense to pay to have two or more restaurants around. and so many chefs, just because every now and then it gets seriously busy.

There is another, more complicated, but more vastly more scalable approach to server architecture for processing web requests. The model is called 'event driven' and is most commonly seen today in desktop GUIs, but is growing in popularity as a way to build web servers and I hope in the future, application servers.

It works much the way a real restaurant works, not by assiging a chef to each person eating, but by breaking the job of serving across a group of specialists who work on one part of the meal. Each specialist has an in tray and and out tray of things to do. If the head chef gets too busy, the not so busy sous chef can pitch in for a while. The end result is a better quality of service for the happy eaters and an economical basis to running a restaurant.

In other words, the event model works the way we design restaurant, factories, shipping ports, or almost any real world production systems where resources need to be used efficiently. What makes it scalable is that the work is done by specialists who don't a) get in each others way, b) are optimized to do a particular job - you can't logjam the system early the way you can with thread per request architectures. If you're concerned about avoiding the Slashdot effect in an economic way, the first thing to do before you run out and buy that clustering solution is consider whether your web server is up to the task. And if you can't, at least consider improve your existing servers' policies.

The webservices

Web services as currently designed will make the Slashdot effect worse for two reasons.

First is that the speed at which links are followed will increase. Today sites go down at the rate of people's ability to click on a link. That's quite a low rate, compared to the speed a computer can click links. Machine to machine web services will greatly up the overall clickthrough rate. We've already since this happen; Google, among others in the past, has had to tune its spiders to prevent them swarming on a site, even taking it down. The spiders turned to locusts. Another case in point are the periodic harvesting of RSS feeds. As we continue to automate the web we can expect to see an explosion of web traffic, orders of magnitude greater than todays.

The second is that basis by which HTTP can faciliate caching is violated by webservices, particulary the RPC variety. Web caching depends architecturally on intermediaries (or proxies) understanding what they can and cannot cache from entities they know nothing about. In HTTP this is possible since it has a handful of standard request methods whose meaning and implications with respect to caching are quite clear, standard header metadata for those methods and standard responses to requests whose meaning is also clear to any intermediary that is coded to HTTP. In other words, while it is not fully adequate, HTTP is designed with caching in mind and documents can be cached. Web services, particulary SOAP messages, have no such facility. Method names are arbitrary as are their reponses; there's no basis by which an intermediary could begin to cache web service requests and responses from arbitrary sources unless the webservices methods are mapped directly to HTTP. Not only this, since web services usually tunnel through HTTP they'll affect the overall quality of service on the web if they become a significant fraction of web traffic. For some this will be a disaster, for other it will be a lucrative opportunity for pushing tin. Either way it represents something I've mentioned before; you have to think differently about programming on the web scale, it's not just a extension of middleware. Arbitrary names with arbitary semantics make sense on the LAN, not on the web. Designing web services under the same principles as a J2EE middleware solution is just asking for performance and availability trouble.

Technically this state of affairs might not look much different to denial of service (DOS) attacks today, except that it will be the order of things, rather than the exception. DOS is one class of attack that is of great concern to security analysts; it's extermely hard to prevent and not hugely difficult to mount - the best known preventative measure is fail fast: to let servers fall over until the attack has subsided. In any case, that's what most of them do.

January 5, 2003 04:23 PM


Trackback Pings

TrackBack URL for this entry: