MongoDB

Realtime Statistics in MongoDB

There is now an example of this sort of aggregation here: Realtime Stats in MongoDB Example (PHP)

As I troll the MongoDB user group (mongodb-user) I notice a LOT of questions about realtime statisitics handling in Mongo. As I have repeated myself about 100 times over I will write a blog post about it.

Many individuals try and treat Mongo like SQL and say “Ok how do I do big massive aggregated JOINs between massive tables?” The answer, my friends, is that you don’t. Yes that’s right Mongo of course has no JOINs so there is no real way to do aggregated JOINs across many pieces of data, instead Mongo has two methods:

  • Pre-aggregated
  • Map Reduce

As a disclaimer I personally do not advertise the use of MRs at all. I don’t like them to be honest with you so I will not be talking anymore about MRs (I will do some MR articles another day but not right now).

The method which instead I will discuss is pre-aggregation. As we all know when picking data straight from a large collection MongoDB can tend to be much faster than SQL it is specifically designed to be when used this way and is very fast at doing 10’s of queries a second, so why try to use it like a RDB and shove as much as possible into one single call (or god forbid should you fall into the trap of trying to use…stored procedures)?

Lets start with an example of pre-aggregation. Imagine I have a video site with a load of videos on. I want to record all hits on videos within certain time ranges but I also want to store repeated views and unique views so as to understand how many returning viewers a user has over a period of time and how many unique viewers they drew in. I also would like to pre-aggregate stuff like browser type and referrer and find the best referrers to that video over a period of time.

Now this period of time I keep talking about could exist in the form of:

  • all time
  • day
  • month
  • Year

I decide these “time buckets” by saying how I would like to query, the words in brackets explain the time bucket that can handle these queries (I’ll explain that part later):

  • I would like to get hourly stats (all time)
  • I would like to get daily stats (all time)
  • I would like to get stats over days, possibly weeks (day)
  • I would like to get stats over months (month)
  • I would like to get stats over years (Year)

Now Lets think about the collections I would need. To make querying as easy as possible I will break down the time buckets into collections like so:

  • stats.all_time
  • stats.day
  • stats.month
  • stats.year

Note: I have combined the unique and recurring views collections using a “is_unique” boolean field in the document. If your collections grow upto trillions of rows you can split the two.

Ok so you got your time bucket collections in what the time sections you would to query on.

Now lets think about what we would like to store:

  • Unique hit
  • Recurring hit
  • Browser total

So here is an example document I thought up:

{
  _id: {},
  video: {},
  session_id: {},
  hits: 0,
  unique_hit: true // This denotes of the hit is unique, if false then it is recurring
  browsers: { // This array holds the browsers seen over the period of time, we will $inc this according to what we see
    chrome: 0,
    ie: 0,
    firefox: 0,
    other: 0
  },
  ts: number() // This holds a UNIX TIMESTAMP of the time step of the document
}

So we have our collections and a sample of our document. Lets mark updates.

So a new user comes in and gets assigned a session_id for the lifetime of their browser session (cookie lifetime of 0). We need to find out if this session_id has been here before so lets search for this video and session_id in the all_time collections:

var res = db.stats.all_time.find({ video: 2394785, session_id: 34857 });

if(res){ // This user is not unique so we update the not unique documents for day, month and year
// We upsert the documents just incase they are not already there.
// For the purpose of this test we will say they are using the chrome browser to access our lovely pages.

     db.stats.day.update({unique_hit: false, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: false, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: false, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
}else{

     // Insert a new row into the all time so that we cna undersand if this is unique or not next time.
     db.stats.all_time.insert({ session_id: session_id, video: 2394785, ts: new Date() });

     db.stats.day.update({unique_hit: true, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: true, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: true, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
}

And there ypou have it! You can now track a hits on a video and browser types, so for example you wish to show (on a graph) all hits on a video for the last 7 days:

db.stats.day.find({ video: 2394785, ts: {$lt: new Date(), $gt: new Date(//ts of 7 days ago)} })

I know what many will say, “urgh but this means I gotta de-normalise and that’s bad!!!1111” What’s so different in SQL? Yes the query would be one with aggregation etc but then because the query would be so heavy on your server you’ll find that actually you’ll be using memcached or big table designs to actually make it go fast anyway and memcached ain’t realtime bro. So quit your whining and get on it!

Now we got that sorted you will notice that I missed out referrer, why don’t you try this yourself?

If you need anymore explanations etc just post a comment!

Happy coding,

Advertisements

One thought on “Realtime Statistics in MongoDB

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s