MongoDB

MongoDB Paging using Ranged Queries (Avoiding Skip())

Paging in MongoDB has always raised a lot of questions from developers and DBAdmins alike. In this article I will attempt to deal with one of life’s greatest dilemmas and talk of pagination in MongoDB.

What is wrong with skip()?

Nothing. Skip is an awesome and necessary function for 90% of paging cases, however, there are certain times where skip becomes slow and cumbersome for the server to work on.

The first consideration with using skip is to understand exactly what is required to commit a skip of x documents. In order to do a skip MongoDB must effectively do the work server-side that you would normally do client side of reading each record out and going over it until it reaches the right one. It normally does this via the _id or $natural index.

That being said, skip will work wonders on small result sets so you must be aware at all times whether reading this post for your queries is actually micro-optimisation or not. I personally cannot tell you whether you are micro-optimising your queries. Skip deprecation should be done with massive testing prior to reading this post, however, I personally have found that skip can work effectively into the 100’s of thousands of rows, maybe up to a million.

Believing that avoiding skip will solve all your problems is bullshit.

I can no longer live with Skip() what are my choices?

The solution to using skip is, instead, to range on the index. There are two operators in MongoDB which help with this:

Essentially, these operators tell MongoDB to only take a portion of the index into account when you go to query. As an example:- imagine I wish to range 20 million records and my index is {_id: 1}:

var c = db.c.find().limit(page_size).sort({_id: 1});
c.min(the_id_of_the_previous_document_from_the_last_page);

As you can see I range the index without using range operators. This means that range operators, such as $gte are separate from the pagination allowing this method to cover all scenarios.

The hard part is making the index. There are many considerations to take into account with this type of querying, sharding being an obvious one. You want an index that covers your queries speed but is not so complex that it cannot be split off into chunks. The ideal solution is where you page from 1,000,000 to 1,000,020 using only a single node in the shard set.

Let’s take an example; say I am paginating a Facebook friends list. I might have an index of {user_id: 1, mutual: 1, mutual_friends_count: 1, date_created: 1}.

There are two ways querying this index can happen:

  1. Without any complex operators to show all friends
  2. With a complex $gte to accomodate showing only recent friends

In this case it might be better to make the index {user_id: 1, date_created: 1, mutual: 1, mutual_friends_count: 1} so you can house a user’s friend list per shard while having the detail data sorted by time so you can easily range on the data within that shard while confining your query to that shard only. This would then be a fast query and should return almost as fast using single server.

So, the biggest problem with range pagination is the index. If you can perfect the index then you can easily make these queries scale.

But… What about skipping huge sections of my data?

Now you have come to the nasty bit. If skipping huge amounts of data were that easy don’t you think MongoDB would already do it for you?

To skip huge sections you need to understand that it isn’t just the DB that has to do the work, the application has to do something as well. Just as many sites who deal with log data or other people’s drunk nights out like Cloudflare and Facebook does.

Facebook timelines and many other large paging systems now accommodate a large skip via the interface. Notice how you have that year filter on the right hand side of your new timeline? Well, that’s how they do it! They simply narrow down the index via a year you pick. Simple, clean, and more effective and user friendly than putting in a random page number into the address bar.

Cloudflare, on their statistics page, gives you a huge summary graph allowing you pick out specific ranges using your mouse to highlight parts of the graph. Again, this works a lot like Facebook’s own, whittling it down to a pre-defined set, or range, of documents.

In fact if you look at 99% of all applications, whether web based or not, that have to deal with huge amounts of data (like Apache logs) you will realise they do not implement normal paging.

So the first and most important thing to understand about how to do large scale paging is that you can no longer solve your problems with just a bit of HTML and a skip(), you need an interface.

Just look at how other programs/sites do it and you will realise how you should do it.

What if I need to pick random records out for paging or just in a normal query?

This is a little off topic but can be used for paging as well.

sometimes people like to pick random records out from their collection. One option is to use skip() like:

$cursor = $collection->find();
$cursor->skip(3);
$cursor->next();
$doc = $cursor->current();

And again, this is good for small skips. But what if you need to do this continuously and at random points?

If you were to do this often and in random ways it might be better to use an increment ID, combining another collection with findAndModify to produce an accurate document number (Doc Link).

This, however, induces problems, it is slow (findAndModify adds locks) and you must maintain this ID, especially when deletes occur. Here is a solution using $gte to get the next nearest number. If that row still exists it will get that row, if not then it will get the next nearest random number:

$cur = $db->collection->find(array('ai_id' => array('$gte' => 403454)))->order(array('ai_id' => SORT_ASC))->limit(1);
$cursor->next();
$doc = $cursor->current();

Post a comment if you would like further info or see errors.

Advertisements

23 thoughts on “MongoDB Paging using Ranged Queries (Avoiding Skip())

  1. This only works if your time stamps are unique. If you have a post with identical time stamps and the last item in a paging range end on the first of these, the second one won’t be displayed because your next query shows items greater than the first of the duplicate time stamps which doesn’t select the second.

    Admittedly this may occur in a small number of cases but for some apps where showing all records are important ie logging or financial systems this solution may not be acceptable.

    1. Indeed, I actually wrote this post too quickly to answer a common scenario on mongodb-user. In that case (which has happened to me before) I would use the _id. I might rewrite this to be comprehensive, provided I can find the time

  2. Well done on the update, what you describe is currently the best solution to this type of request but as you have hinted to when sharding gets involved its not ideal.

    What if you sharded on the _id and timestamp to enable fast activity streams but then need to be able to show the last 20 posts by a particular user? Would this query need to hit all the machines to get a dataset?

    And what if the shard key isn’t used in the query? Is it processed by all shards and merge back?

    Can you create extra shard lookup indexes on non sharded keys as in the example above (id, user, timestamp)?

    1. It depends on the activity stream. I personally go by user_id and _id (as you show for the last question) and I get all subscriptions for that a user. It hits about 5 machines at any given time which as you said isn’t ideal but it is stil dead fast fortunately and very scalable. In this case I have actually sharded on _id, user_id and ts which seems to work.

      There are ways to optimise this. For example you can, everytime a users subscription writes a new activity, copy that activity to the users stream. This way you only query by one user_id and that should hit only one machine.

      If the shard key isn’t used then it will be forced to do a scatter and gather query: http://www.mongodb.org/display/DOCS/Sharding+Limits#ShardingLimits-Queryspeed which is ok if your dataset is small but if it is larger…watch out

      That last one is a good question I wouldn’t mind looking into that. I know that indexes work differently and even though the sharding may not occur on that index I believe MongoDB can do something special with the index (even if it isn’t the shard lookup) that will shortcut, but I would post on mongodb-user about that.

      It should be very fast since even though it would be a global scatter and gather op (http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-OperationTypes) MongoDB should be able to whittle down index entries very quickly. I personally have never had to query by another index over an extensive shard range with the user activity stream being the biggest range which I can query by either _id or the compound index of all three fields to identify the users posts (remember MongoDB, like SQL can use partial indexes too so even though you have _id,user_id and ts a query by _id will still hit that index since the shard index should take priority over the default _id index).

      MongoDB docs do have a protion on index optimisation in sharding: http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key#ChoosingaShardKey-Indexoptimization but it is more about making sure only small portions are queried.

  3. Very nice and illuminating post!

    I was wondering how could I get pagination without duplicated items in a very dynamic collection. I am developing an application where people will post votes for each other, but while an user load the next page, the votes could had change the order as the list is ordered from who get more votes to who get less. Since I don’t need to show live data, I could update it every 5 minutes or more, my solution was to load the entire data set and cache it, then paginate in the cached items. Is there a better way to do this?

    1. Hmm that is a tricky one and score boards can be very difficult here. The way that I see other sites do it such as stackoverflow or game sites is that they cache, as you mentioned.

      Normally for about an hour or so; other sites like stackoverflow cache for upto the entire day (due to user base) and then when the cache refreshes they just don’t care if page 2 has the same user or two as page 1.

      I think most sites consider it too much work to do this kind of stuff truly dynamically.

      One way to do this in a way that would not be confusing is to solve a little bit of an interface problem in your app. You could show how many “places” the user has fallen. This way if the same user on page 1 appears on page 2 due to score recalc the user will know and the cache will stop the list from constantly moving. You could even make a message be sent out to your app to display to everyone when the list has been refreshed.

    2. I did notice that I actually made a mistake last time I posted here, I did not take into account _ids in comparison to order of the inde which means that you might lose rows on $gt operators (should update this post what that), in which case you should actually be using http://docs.mongodb.org/manual/reference/operator/meta/min/ to restrict the inde to a lower bound and then limit by 10, otherwise you will miss results.

  4. hi sammaye, right now I have a web site, that has 100.000 products and, products need to be filtered by their price, city etc., after I get the results, there should be a pagination, and I have no idea other than using skip for pagination, do you have any idea or any schema so I do not use skip ?, thank you

  5. What if your ‘ts’ field is really the time you want to sort on but you don’t want to miss records. IE _id is insertion time but my ‘ts’ may be days old when it finally gets inserted. I would like to page and have the sort based on something other than _id, ie in this case a ‘ts’ field (which is a timestamp).

    1. So long as your sort is based on something you should be able to range by _id of the document, just say you want everything over that _id ($gt) since sorting should denote that you won’t miss or return duplicate records, that is the key thing is sorting.

      As for ranging on the ts, if you pick $lte on your first query from highest and on your second $gte while using the _id it should not pick miss records with the same visibility.

    2. I did notice that I actually made a mistake last time I posted here, I did not take into account _ids in comparison to order of the inde which means that you might lose rows on $gt operators (should update this post what that), in which case you should actually be using http://docs.mongodb.org/manual/reference/operator/meta/min/ to restrict the inde to a lower bound and then limit by 10, otherwise you will miss results.

  6. I want to use paging but my sorting does inot requires timestamop and i dont have any other field that would be unique to implement. So how can i use this in my application???any suggestions

  7. This works if you want to sort by a field with unique values (like _id) but this technique will not work well if you want to be able to sort by various different fields (say you have a paged grid of data and you can click on the columns to change the sort order). You would basically have to replace the ts in the above examples with whatever field the user is sorting on. If the values for that field aren’t unique (e.g. city or first name), you are going to get incorrect results from page to page.

    1. Indeed, I have since found ways around this. I should have updated this post but I rarely find time to backdate my posts these days. A quick way of sorting this is using the $max and $min to actually range the index itself which means that you still range but without the problem of the sort omitting documents.

  8. Hey Sammaye, I’m currently facing an issue where I don’t have control over indexes (can’t create compound indexes) and I need to sort a collection based on a field that has repeating values (male / female). Ranging on _id works, but doing so with the gender field keeps returning the same set of data.
    The query looks like this : {gender: { $gte: “male” }}. Any idea how to overcome that problem?

    1. What you can do is use $min and $max to actually factor down the index on MongoDB’s side. These are special operators that specifically tell MongoDB where in the index to perform the query you are passing to it.

      In this case you can pass the _id into $min and use $limit on it with the sort an you should be able to range on that bad boy.

  9. Nice article. But the part with non unique range ordering is too short. An example could be sorting posts by coment count. I think you would need a compount index with the comment_count and the _id. The you would need to remember both values as your cursor and during query, you have to make a query like (javasyntax: or(lte(“comment_count”, lastCommentCount)), and(eq(“comment_count”, lastCommentCount), lte(“_id”, lastID)))
    If you even know how to generalize that or know any frameworks which help you with this, it would also be nice

  10. hi sammaye need you help on paging result of fulltext search, since it sort based on score => score: { $meta : “textScore”} we will display highest score on top. any idea use ranged query since the score is not unique

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s