Julian Hyde: "You would think that something called a 'feed' would push content is
pushed to subscribers as soon as it arrives, but in fact RSS and the
other feed types in the prototype use a pull protocol. With a pull
protocol, the subscriber needs to continually poll the feed to get the content (typically an XML document a few kilobytes
long), parse the content, and figure out what, if anything, is new
since the last time we polled.
This process soaks up a lot of
network bandwidth and resources for both the provider and the
subscriber, and the cost goes up the more regularly we poll. Typically
the provider has to throttle the feed to prevent their servers from
being overwhelmed. For example, Twitter updates its feed only once per
minute and limits the number of tweets on the page. At times of high
volume, only a small percentage of tweets make it into the feed.
This
may not sound that serious if the content is a twitter conversation
between friends, or a blog with one or two posts a week. But web feed
protocols are becoming part of the IT infrastructure, and business
users require lower latency, higher throughput and higher availability.
(The existence of services like Gnip is evidence of the need to control the web content chaos.)"
I would like to know how to scale this so that the origin server does not melt down under query load. Let me explain, assuming the origin server is backed by a relational database.
Most people that want real time efficient feeds are concerned about bandwidth overhead or the apparent technical stupidity of polling the same data over and over. They would just like what has changed since the last time they asked. It's clearly more efficient and better. Let's call this a "bespoke" feed model.
What tends to gets forgotten about with bespoke feeds is that each client request forces a subselect on the database. This model is not likely to scale nearly as well on the server as resending redundant information and letting the client sort it out locally, however dumb that approach might seem. The Atom format for example is designed so that the client can sort it out locally by virtue of the atom:id and atom:updated values.
The alternative polling option people arrive at is to not support bespoke queries but to serve the same redundant data to all clients. Let's call this the "one size fits all" (osfa) feed model. It is the standard approach on the Web for scalable, high availibility feed serving. The osfa approach "works" insofar as it assumes a lot of clients are accessing data and makes a tradeoff preferring bandwidth overhead to database load. This tradeoff makes a lot of sense as the number of clients go up - anyone who builds database backed websites quickly learns to reduce the number of calls on the database, be it through query caches, L2 object cache, caching proxies, and so on. An osfa approach allows the data to served off disk directly, making it a pure file serving problem, which is far easier to scale than hitting a relational database.
So, where does that leave us? Well I think if you must allow per client querying for a lot of clients, you need to be sure the server can handle the database load at scale. If you are really worried about bandwidth then compression is the first obvious thing to do. Another is caching, but that leads to data latency and if you are asking for "just" the changed data there is a chance you want that data "right now" as well (more on that in a minute). You might also think that sending down less data will be a win - but this really depends on your use case. Replacing one coarse grained fetch with 4 fine grained queries isn't neccessarily going to lead to a better user experience or sane usage of the data server, though a client developer might find it convenient to not have to om nom through a larger dataset. If you are familar in enterprise development with the .NET/JEE antipattern of data access that leads to the use of DTOs, well, fine grained feeds present similar issues.
Julian has a suggestion:"I would like to see the emergence of a genuine 'push' protocol for web-based content. It doesn't have to be particularly complicated. To illustrate what I have in mind, here is an example of a simple, stateless protocol, built using XML over HTTP, like the current feed formats. A subscriber sends a request
<readRequest>
<minimumRowtime>2008-12-04 18:00:46.000</minimumRowtime>
<maximumCount>1000</maximumCount>
<maximumWait>10s</maximumWait>
</readRequest>
over HTTP"
I would like to see such a thing as well. But.
"According to the protocol, the provider sends the results after 10
seconds, or when there are 1000 records to return, whichever occurs
sooner. After it has received a result, the subscriber will typically
ask for the next set of rows with a higher rowtime threshold.
Even
though it is simple, the protocol ensures that data flows efficiently
for feeds of all data rates. For a high volume feed, the 1000 record
limit will be reached before the 10 second timeout, so latency
naturally decreases. For a low volume feed, many requests may time out
and return an empty result; but the 10 second wait limits the number of
requests per minute that the server has to handle."
It is simple, but by virtue of assuming the data server can handle the load of pushing out the data and managing subscription state; the protocol does nothing to manage that part of the architecture. Good client-server protocol designs (where good means scale to large numbers of both) try to avoid or mitigate these kind of asymmetries.
Back to latency. Many web sites scale of the basis of the data being latent - even a few minutes can make a huge engineering and operational difference, especially as your application grows beyond a single cluster (or geographic location). IMO the mapreduce pattern scales not just on parallelisation but on the data latency the results are allowing to have (which is why it gets used a lot for log/warehouse analytics and post-hoc querying). So if you demand real time precision in the data, be aware that this can put stress on your server.
"Real time" requirements in turn might lead you towards a push model, but I think it's reasonable to say that we don't know how to do internet scale push yet, at least not without creating asymmetries - its hard to have a lot of clients to send data to, and the problems gets harder as you add things in like filtering and long held connections by clients that will have you ripping out those loadbalancers.
For push, I think XEP-60 is worth looking at, even though we (imo) have work to learn how to manage mass subscriptions, and if you are interested in systems architecture, Rohit Khare's ARRESTED model.
19 Comments
Conditional requests and something like RFC3229+feed will address any concerns with polling more than adequately. You pay a very small bandwidth overhead in exchange for not having to manage 10,000 permanently open connections. Better yet you can do the delta processing on a reverse proxy and/or conditionally disable it as load permits; plus a mechanism for catch-up by intermittently disconnected clients naturally falls out.
Polling = low coupling, and low coupling is better in the vast majority of cases.
On the global internet, 125ms latencies are achievable; 30ms between data centers in a single country. (Except in case of a network failure.) This is cheap with push. Achieving those with polling requires 8Hz and 33Hz polling rates respectively; if the average update frequency is 1/3600Hz, that's 4–5 orders of magnitude overhead for polling.
If you have more than a few tens of thousands of users per front-end machine (such as a reverse proxy), then yeah, you might want to use a push mechanism that doesn't require one open TCP connection per user. For example, you could send updates over IRC, Jabber, SIP, email, or HTTP POST.
Email remains important, 18 years into the WWW era, because it does support push and unsolicited communication (despite the unavoidable burden of spam that comes with that). Other protocols that support these models are also on the rise: AIM, Jabber, MSN Messenger, SIP, Skype.
A push protocol that fit naturally into the web would be very valuable, and wouldn't necessarily impose tighter coupling than the pull protocols we currently have. For example, an HTTP header requesting the server to send a UDP packet containing an URL to a particular IP:port combination the next time the underlying resource changes (essentially, a cache-invalidation protocol) would allow us to use polling only to recover from failures, and could be handled in a natural way by reverse proxies.
Right on the money, Bill.
Inevitably, people overestimate their need for "real-time" data and claim to be intolerant to latency. While there are always applications that need this (e.g., the trading floor) -- and are willing to pay the corresponding price -- the vast majority don't, and shouldn't.
Polling is expensive if you transfer the bytes every time, but conditional requests, compression and small feeds help mitigate this. RFC5005 can help keep it reliable, if that's what's required. The fact is that serving a static/cacheable file to all comers is orders of magnitude more efficient on the server than creating bespoke responses, and easier to scale with CDNs and the like as well.
People have been talking about this for a long, long time. It's interesting that cache invalidation was mentioned, because at about the same time that Rohit et al were talking about internet-scale eventing, the caching industry was getting very interested in putting together standard protocols. We learned, as it was pointed out by Bill, that it's really hard to scale push with any kind of guarantees, and the asymmetries are really unattractive.
It's not a coincidence really; people are using HTCP as a push protocol today, between machines in the same administrative domain.
I don't think the problem with guarantees in push is scaling. The problem is who cares that the information gets delivered: the publisher or the receiver? Obviously, in the context of the web, it's the receiver that cares; the publisher doesn't even know who they are. Really I think this is mostly socially optimal: people should not receive information they don't want, and should not be prevented from receiving information they do want, except in extreme cases. By their nature, pull subscriptions live on the subscriber, while push subscriptions live on the publisher.
But putting the responsibility for maintaining the subscription at the end that doesn't care about its reliability, or even its existence, is going to make it really hard to provide guarantees.
But push as an optimization to pull, with fallback to polling in case of failures (instead of trying to provide guarantees), should work fine with none of these drawbacks.
<a href=http://www.jamendo.com/en/user/AchatViagraAcheterVIAGRA>viagra</a> conditions <a href=http://www.stade.fr/forum/member.php?u=18577>achat viagra</a> 4155 http://www.stade.fr/forum/member.php?...
<a href=http://myworld.ebay.fr/cialis-achat>cialis</a> ! cialis en ligne. Generique Inde pharmacie en ligne Bon Marche Cialis 8] <a href=http://myworld.ebay.fr/cialis-achat>Achat cialis Simple</a>
73299 hgdcvd <a href=http://www.arte-arezzo.it/moodle/user/view.php?id=209&course=1>cialis compra</a> 19480 csdjchs http://www.arte-arezzo.it/moodle/user... compra cialis :) uc jk kda askd k 53838 <a href=http://www.arte-arezzo.it/moodle/user/view.php?id=209&course=1l>compra cialis</a> 66039
17088 enkrjvfd <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-compra>cialis compra</a> 36383 djh c http://members.ebay.it/ws/eBayISAPI.d... cialis online :) wttwd 81071 <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-compral>cialis</a> 54928
excellent view actually...
thanks for ur post...
Thanks a ton for the list!!
acquisto <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>cialis</a> sildenafil <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>acquista cialis</a> comprare
<a href="http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra">acquisto viagra</a> acquistare http://members.ebay.it/ws/eBayISAPI.d...
ordinare <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>cialis</a> sildenafil <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>comprare cialis</a> acquistare
<a href="http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra">acquista cialis</a> acquisto http://members.ebay.it/ws/eBayISAPI.d...
http://www.discountretroshoes.com/air...
http://www.discountretroshoes.com/wom...
thank you for shopping at the wind shoes storel http://www.discountretroshoes.com/guc...
http://www.discountretroshoes.com/guc...
http://www.discountretroshoes.com/wom...
http://www.discountretroshoes.com/wom...
http://www.discountretroshoes.com/guc...
http://www.discountretroshoes.com/bap...
http://www.discountretroshoes.com/nik...
http://www.discountretroshoes.com/des...
http://www.discountretroshoes.com/mon...
http://www.discountretroshoes.com/des...
http://www.discountretroshoes.com/des...
http://www.discountretroshoes.com/cla...
http://www.discountretroshoes.com/dus...
http://www.discountretroshoes.com/san...
http://www.discountretroshoes.com/cla...
http://www.discountretroshoes.com/cla...
http://www.discountretroshoes.com/boo...
http://www.discountretroshoes.com/ugg...
http://www.discountretroshoes.com/cla...
http://www.discountretroshoes.com/oat...
http://www.discountretroshoes.com/gol... classic cardy ugg boots www.discountretroshoes.comwelcome to shop here.thank you!
http://www.airjordanmart.com
http://www.headphonesky.com
http://www.up2heels.com
http://www.toolsinhair.com/
http://www.aceghd.com
http://www.hotsoftshop.com
http://www.goodsbox.com
http://www.shoppingtiffany.com
http://www.topbootsmart.com
http://www.cn139.com/gift_bag/
http://www.cn139.com/Promotion_Bag/
http://www.cn139.com/Brand_Bag/
http://www.cn139.com/Bottle_Bag/
http://www.cn139.com/Apparel_Bag/
http://www.cn139.com/Party_bag/
http://www.cn139.com/Apparel_Box/
http://www.cn139.com/Lingerie_Box/
http://www.cn139.com/Printing_box/
http://www.cn139.com/Medicine_box/
http://www.cn139.com/Jewellery_box/
http://www.cn139.com/Perfume_Box/
http://www.cn139.com/Cosmetic_Box/
http://www.cn139.com/gift_box/
http://www.cn139.com/Printing_box/
http://www.chinayujie.com/
http://www.aabag.com/gift bag
http://www.chinayujie.com/paper-box/
http://www.chinayujie.com/paper-box/C...
http://www.chinayujie.com//paper-box/...
http://www.chinayujie.com/wooden-box/
http://www.chinayujie.com/watch-box/
http://www.chinayujie.com/wooden-gift...
http://www.chinayujie.com/wooden-meda...
http://www.chinayujie.com/Glasses-Box/
http://www.chinayujie.com/wine-box/
http://www.chinayujie.com/wooden-Perf...
http://www.chinayujie.com/Jewelry-Box/
http://www.chinayujie.com/Cosmetics-Box/
http://www.chinayujie.com/GiftBox/
http://ie.18dc.com/
http://www.5da.com/
http://www.001y.com/
http://www.aige.com/
http://www.qinsen.com/
http://www.qinsen.com/lingerie_design/
I have read it,it is very helpful!
http://www.supershandbag.com
http://www.ugg2you.com
http://www.supershandbag.com
http://www.shoeshoof.com
comprare <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>viagra</a> sildenafil <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>acquista cialis</a> acquisto
<a href="http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra">acquisto cialis</a> ordinare http://members.ebay.it/ws/eBayISAPI.d...
acquisto <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>viagra</a> generico <a href=http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra>comprare viagra</a> comprare
<a href="http://members.ebay.it/ws/eBayISAPI.dll?ViewUserPage&userid=cialis-e-viagra-generico-compra">compra viagra</a> ordina http://members.ebay.it/ws/eBayISAPI.d...