Efficient API Paging: count down, not up


Hueniverse: "Making an API call is like asking a question and if the question is simple enough, the answer is easy to come by. For example, asking to describe a single user is simple. A single database lookup using a single unique key is fast and generally easy to scale. This is the most optimized use case in relational databases. However, asking for the last 20 messages of all the friends I am following is not a simple question. It is complex because it is really a list of questions batched together. To find the last 20 messages, we must ask:

    * Who do I follow?
    * What are their last 20 messages?
    * Of all the messages found, which are the most recent 20?
    * What is the content and other metadata for each of these 20 messages?

This list of questions becomes pretty inefficient with a very large set of friends being followed. However, the data set can be reduced by first finding the top 20 friends with the most recent updates, and only requesting their 20 latest messages. Even with this optimization, we are looking at fetching up to 400 messages in order to display 20. This gets much worse when the request is for the second page of messages (messages 21-40) where we can fetch up to 1600 messages from up to 40 friends in order to display 20 messages. By the 5th page we are fetching up to 10,000 messages to display 20 (which might explain why Twitter temporarily disabled the paging feature of their API)."


Let's ignore the notion of an inbox for now that avoids the who-do-I-follow-means-a-sql-join problem and look at paging over the network. The way some services exposes paging physically ensures each page has to be recomputed every time there's a new addition - new content keeps getting added to page=1, pushing everything else down off the main page causing a ripple effect over the set of pages. As a result, pages easily can't be archived and are tricky to cache - the content of "page=5" changes every time there's new content on the homepage (in Twitter's case new content on the hompage is extrememly frequent), because the paging model is a pushdown stack. Here's that stack:

 home page --> page=2 --> ...
    msgN-1      msgN-10      
    msgN-2      msgN-11      
    msgN-3      msgN-12      
    msgN-4      msgN-13      
    msgN-5      msgN-14      
    msgN-6      msgN-15      
    msgN-7      msgN-16      
    msgN-8      msgN-17      
    msgN-9      msgN-18      

Now watch what happens when we add a new message, msgN-0:

 home page  -->  page=2  -->   page=3
    msgN-0        msgN-9      msgN-18  
    msgN-1        msgN-10 
    msgN-2        msgN-11 
    msgN-3        msgN-12 
    msgN-4        msgN-13 
    msgN-5        msgN-14 
    msgN-6        msgN-15 
    msgN-7        msgN-16 
    msgN-8        msgN-17 

All the messages are pushed down and logically, every page in the stack gets touched. As a result nothing is long term cacheable and nothing is physically archivable to disk. It's inherently non-scalable.

Another approach would be to work the paging backwards by counting down. In that model page=1 will be the oldest page not the newest one. Like this:

 home page --> page=N --> ... page=1
    msgN-1      msgN-10        msg9
    msgN-2      msgN-11        msg8
    msgN-3      msgN-12        msg7
    msgN-4      msgN-13        msg6
    msgN-5      msgN-14        msg5
    msgN-6      msgN-15        msg4
    msgN-7      msgN-16        msg3
    msgN-8      msgN-17        msg2
    msgN-9      msgN-18        msg1
   
the homepage is linked to page=N, not page=1. So great, let's add a new msg:   

 home page  -->  page=N  -->   page=1
    msgN-0        msgN-10       msg9  
    msgN-1        msgN-11       msg8
    msgN-2        msgN-12       msg7
    msgN-3        msgN-13       msg6
    msgN-4        msgN-14       msg5
    msgN-5        msgN-15       msg4
    msgN-6        msgN-16       msg3
    msgN-7        msgN-17       msg2
    msgN-8        msgN-18       msg1
    msgN-9
   
Note that we let the homepage grow by 1. That stopped propagation to the other pages.  When the home page gets too big we can push some messages back to a new page and splice it into the list:

 home page  -->  page=N+1  -->  page=N  -->   page=1
    msgA          msgN-1         msgN-10      msg9  
    msgB          msgN-2         msgN-11      msg8
    msgC          msgN-3         msgN-12      msg7
    msgD          msgN-4         msgN-13      msg6
    msgE          msgN-5         msgN-14      msg5
    msgF          msgN-6         msgN-15      msg4
    msgG          msgN-7         msgN-16      msg3
    msgN-0        msgN-8         msgN-17      msg2
                  msgN-9         msgN-18      msg1


Upside? N-2 pages can be archived, even taken off your active database and served as flat files or out a memcached layer (if a database what you're using). If you don't mind a long homepage, you can manage the resizing just on the homepage and splice in a new page when the length of the homepage is 2x your archive page size. Downside? Somebody will file a bug that the main pages vary in size.

By the way, if you designed your Data API around Atom and RFC5005 thus telling everyone "follow the next link" instead of "compute the next link's url, then follow it", you could switch physical paging models after the fact - the user agent doesn't care about your notion of a linked list, it'll key off the "rel" paging attributes. Whereas if you exposed the counting model directly via URL parameters, all clients will need to be upgraded. It's basic Data API practice to tell clients to follow typed links instead of asking them to compute parameterised URLs.

Tags:

4 Comments


    "Downside? Somebody will file a bug that the main pages vary in size."

    Or rather, "the main pages very in size when JavaScript is disabled."


    Assaf, this does not have to be exposed as-is to end-user. Since you know the total page count, you could display the page number in the traditional manner. For example, if the collection has a total of 10 pages, real page 6 could be displayed as page 10-6=4.

    Moreover, using number to enumerate pages is just a convention. You could enumerate them as 'newest', 'newer', 'older', 'oldest'. The page indicator could be a circle of decreasing sizes (smaller circle is older). Using number is just a convention, not a rule.


    With time-sensitive data such as micro-blogging content, paging by index position in a result-set seems like a category-mistake. Why even set the expectation that there is a complete list or that a fragment of this list has a fixed-size? Who's requesting this data and what is the goal? If it is to find the most recent posts (i.e. time-sensitive), then segment results as "the next 10 messages you haven't seen and that are more recent than any of the ones you have seen" and "the N messages that are MORE recent than the ones you've seen and that you probably care about more". In this model, paging is a query operation: at time t1 (now), return N items beyond time t2 (time of last item in current page), and any items before time t1.


    Duh.. check that. The first clause should read "the next 10 messages you haven't seen and that are LESS recent than any of the ones you have seen"