« 80% done | Main | The Uniform Interface »

The Web scalability myth

O'Reilly Network: The PHP Scalability Myth

The ideal multi-server model is a pod architecture, where the router round-robins each of the machines and there is only a minimal session store in the database. Transient user interface information is stored in hidden variables on the web page. This allows for the user to run multiple web sessions against the server simultaneously, and alleviates the "back button issue" in web user interfaces.

Well said. Let's take it a step further. The ideal multi-server model is where state is manged on the client. It's the one place on a client-server or service oriented network topology where a single point of failure is ok (think about it).

Now, this idea, this management of state of the server is also where the deployed web (HTTP +browsers) breaks with REST architecture. REST advocates (including me) are happy to point out that the web is the epitome of a scalable, flexible system. But this has been helped by spending billions of dollars on scaling sites to manage session state, an invisible web of content delivery networks and geographic caches, some questionable ideas, such as DNS round robin, session based URL rewriting, and cookies, and any number of hacks and compromises in millions and millions of lines of application sofware to cater for primarily one thing: sessions on the server.

If you're running a web site or service, this is costing you a lot of money - very possibly the bulk of your development and running costs are sunk in making sessions scale up and out. It costs you more as you get popular - for a site there is no economy of scale in the current web (which is why web based business plans derived from the economics of broadcast media often go to wall). Sometimes we call this the curse of the popular, denial of service, or in vulgar tongue, the slashdot effect. All those servers you shelled out for are idle almost all the time, yet the day you do get slashdotted, you won't have enough computational horsepower to hand (cue business models for P2P and utility computing).

And as far as I know, REST advocates (including me) have no good answer to change the state of affairs on the deployed web, other than to encourage people to avoid state where it's not needed (it's often not). We get enough grief from WS middleware types as it is, and wouldn't want to goad them by making insane arguments, for example fixing every browser on the planet to store user sessions, so that it became your shopping cart, not Amazon's *.

Honestly in the long run, this problem may only go away as lesson learned from P2P, Grid, telco and utiltilies architectures are absorbed in mainstream web development. I imagine this will happen via SOA projects. There is already considerable interest in Grid and utility computing as complements to the web for SOAs, and P2P can't be far behind.

In the meantime there are web servers like Matt Welsh's seda and Zeus that can help alleviate against the slashdot effect. We built a blisteringly fast web server at my previous job, architected by Miles Sabin (one of the java.nio architects along with Matt Welsh), but it never made it to market - today it would make a fantastic basis for a SOAP router or XML content firewall.



[*] This issue of the deployed client base also relates to a debate that occured in Atom - whether to use PUT and DELETE, or just POST. In my mind there is zero technical justification for using only POST. But there is a key practical one, which is brutally simple- the HTML spec, and therefor browsers, don't support form upload with PUT and DELETE, so what's the point of specifying a technology almost no-one can use? My answer is that blogging and RSS represent green fields in web development and don't have to be considered in terms of legacy browsers and bad decisions in web specs, but not everyone agrees with that.


October 18, 2003 02:41 PM

Comments

Geert Bevin
(October 18, 2003 05:43 PM #)

I can't agree with you more about the importance of keeping state outside of the server. Indeed, RIFE (the web framework I'm developing : http://rife.dev.java.net ) currently prohibits any storage on the serverside. It works around this limitation by allowing developers to declare the data flow together with the logic flow of the site. Every parameter is linked or globally declared and the links are automatically generated correctly to pass data around while respecting the declaration scope. Element that process the requests are only aware of the data inputs they declare and can only set the data outputs they declare. This could for example be envisioned like this : http://rifers.org/docs/usersguide/numberguess_site.png
As REST states, this makes each url or request exactly define the location where a user is in the application. Like you say the back problem is not present anymore and complicated issues like load balancing are solved by a simple round-robin approach.

While this has worked well for almost all our projects, there are two major limitations to pure client-side state preservation: the data is visible to the client and carried around, and it's difficult to handle progressive states (wizards) since everything needs to be validated at each submission. We think to have solved this problems too and are in the progress of implementing the solution that's described here : http://www.uwyn.com/pipermail/rife-devel/2003-August/000012.html

Anu
(October 21, 2003 01:00 AM #)

No state on the server? Hmm, could be a problem when you generate huge amounts of data in response to a user request. How would google searching work for example? Run the entire search each request and through away the non visable portion?

Baz
(October 27, 2003 11:34 AM #)

Anu - why not? I've built a search engine for a large commercial site, and user testing showed that there was an exponential decay in the no. of users clicking on the "next page" button - i.e. a roughly constant fraction (much less than half) clicked it on page 1, page 2, page 3...

Stats reported for other search engines show this fraction is typically 23% (Google's is probably lower as their relevance is very good). So 77% of the time, no pages are served from session data. 95% of the time, at most 1 page is used. 99% of the time, at most 2 pages are used; and so on. It ends up that something like 98% of the memory consumed by your sessions would never be accessed.

The devil is in the details - how long does a search take, how much memory is consumed by results - but it just demonstrates that its not /at all/ obvious that storing search results is a sensible thing to do - recalculating them may be better.

Trackback Pings

TrackBack URL for this entry:
http://www.dehora.net/mt/mt-tb.cgi/1106

Listed below are links to weblogs that reference The Web scalability myth:

» Server state and REST from Random Stuff
Bill de hra writes about the Web scalability myth: The ideal multi-server model is where state is manged on the client. It's the one place on a client-server or service oriented network topology where a single point of failure is [Read More]

Tracked on October 19, 2003 09:29 PM