MongoDB

Debug MongoDB Map Reduce with the Mongo MapReduce Web browser

I am in the process of rewriting this, if you want to see what I am excited about then checkout the repo here: https://github.com/angelozerr/mongo-mapreduce-webbrowser

You may, or may not, remember (depending whether you are new or returning to this blog) I wrote a post sometime ago about a new Map Reduce debug tool within your very own browser – using jQuery. My original sentiment on the subject was that it was still too rough to test, however, the author @angelozerr and @pascalleclercq have worked hard on the release of v1.0.0 with the intention of making it stable and boy, have they made it stable.

Here is the live demo. I recommend you try it out, I am going to blabber on down here about some of the things you can with it now.

First off, you are greeted by an interface with the folders containing your map reduces on one side and the potential preview of your files on the other. If you double click (I tried single clicking at first too) on any of the example map reduces then you will see it load up in the right hand pane.

Example Mongo MapReduce WebBrowser

Feel free to click on the image to zoom in, it should be 1080 compatible. From the screen shot you can see what I mean about the explanation. As an added note there is also a tabbed interface for map reduces (look to the top grey bar) for being able to load multiple files into memory at once.

Some of the key changes to the previous version I looked at is the editor (the right pane in the screen shot). It boats:

  • Code Completion
  • Syntax Highlighting
  • Code validation (with hint popups for telling you the exact error)

For each step of the map reduce. It will even underline errors for you so you can quickly and easily get to them without having to check what the icon in the side bar says. Too add a cherry on top, it also has a live preview where your edits will change the output immediately. All in all this is one slick application.

I had none of the previous problems I experienced, quickly and easily building my own map reduce and seeing its output.

There is still one problem that might catch people out: the input documents cannot take BSON objects such as the ObjectId() yet as such this means you cannot yet input:

[{'_id':ObjectId("455556676678"), "name":"sammaye"}]

That being said you can input documents in their deserialised format like:

[{"_id":{"$oid":"50d8fdedadd222278ba9090f"},"date":{"$date":"2012-12-25T01:14:21.312Z"},"url":"http;//iiiiiiiiiii"}]

However, it currently cannot convert console copied document down to this format. @ angelozerr has some ideas on how he wants to solve this. I am certain this is a feature that will soon be added in one form or another. If it had this feature it looks almost ready to use, I even tested this with around 1,000 documents on the count_tags.js example and it worked, it almost killed Firefox for a while but it worked and it did its job eventually.

I really do recommend now taking a look at mongo-mapreduce-webbrowser and trying it for yourself. It is definitely a tool that no MongoDB user (or map reduce) user should go without.

MongoDB

MongoDB Paging using Ranged Queries (Avoiding Skip())

This post has been rewritten

Originally I gave only one scenario and only one form of paging the paging was open to loop holes. Also I did not take sharding into account for certain paging scenarios etc.

So what is wrong with skip()?

Nothing. Skip is an awesome and necessary function for 90% of paging cases, however, there are certain times where skip becomes slow and cumbersome for the server to work on.

The first consideration with using skip is to understand exactly what is required to commit a skip of x documents. In order to do a skip MongoDB must effectively do the work server-side that you would normally do client side of reading each record out and going over it until it reaches the right one.

A skip does not make very effective use of indexes so it is mostly pointless to put indexes on a collection to improve skipping speed.

That being said, skip will work wonders on small result sets so you must be aware at all times whether reading this post for your queries is actually micro-optimisation or not. I personally cannot tell you whether you are micro-optimising your queries. Skip deprecation should be done with massive testing prior to reading this post however I personally have found that skip can work effectively into the 100’s of thousands of rows, maybe up to a million.

Believing that avoiding skip will solve all your problems through defacto is like…think of stored procedures, not triggers, or certain crazy querying patterns in SQL. It’s all bullshit.

So I can no longer live with Skip() what are my choices?

Well my furry little friend that all depends. It depends on what application you are making and what your data is like and what your indexes are like, however I am going to outline the solution. RANGED QUERIES. Yep, I said it, ranged queries. The idea of this is to first limit anything before your highest skip with an operator like $gt and $gte and then use limit() to create the limited page you want.

Let’s look at an example. We want to make a Facebook feed for what would normally be the users homepage when they login. To begin with we want to get our first page:

var posts = db.find({}).sort({ ts: -1 }).limit(20)

Here I am using the documents timestamp (ISODate) to sort and range my documents.

Note: the ISODate in MongoDB has a visibility issue in that it can only go down to seconds so if you are likely to get documents the same second then use something else. We will discuss this later.

So now we have the first 20 stories for our feed. When the user scrolls the page, to the bottom, we want to fire off another query which will get our next page. In order to accomplish the range we need to know the last timestamp of the document (story) on the page:

var last_ts = posts[posts.length-1]['ts']['sec'];

With this last timestamp we then make the query that will get the next lot of stories:

var posts = db.find({ ts: {$gt: {last_ts}}}).sort({ ts: -1 }).limit(20)

And bingo! Keep repeating steps 1 and 2 and you will have successful paging for large data sets.

But…what about skipping huge sections of my data?

Now you have come to the gritty nasty bit. If skipping huge amounts of data were that easy don’t you think MongoDB would already do it for you?

To skip huge sections you need to understand that it isn’t just the DB that has to do the work, the application has to do something as well. Just as many sites who deal with log data or other peoples drunk nights out like Cloudflare and Facebook does.

Facebook timelines and many other large paging systems now accommodate a large skip via the interface. Notice how you have that year filter on the right hand side of your new timeline? Well that’s how they do it! They simply narrow down the index via a year you pick, good isn’t it? Simple, clean and more effectively and user friendly than putting in a random page number into the address bar.

Cloudflare, on their statistics page, gives you a huge summary graph allowing you pick out specific ranges using your mouse to highlight parts of the graph. Again this works a lot like Facebooks own, whittling it down to a pre-defined set, or range, of documents.

In fact if you look at 99% of all applications, whether web based or not, that have to deal with huge amounts of data (like Apache logs) you will realise they do not implement normal paging.

So the first and most important thing to understand about how to do large scale paging is that you can no longer solve your problems with just a bit of HTML and a skip(), you need an interface.

Just look at how other programs/sites do it and you will realise how you should do it.

What if I need to pick random records out for paging or just in a normal query?

This is a little off topic but can be used for paging as well.

sometimes people like to pick random records out from their collection. One option is to use skip() like:

$cursor = $collection->find();
$cursor->skip(3);
$cursor->next();
$doc = $cursor->current();

And again, this is good for small skips. But what if you need to do this continuously and at random points?

If you were to do this often and in random ways it might be better to use an incrementing ID combining another collection with findAndModify to produce an accurate document number ( Doc Link ).

This however induces problems, you must maintain this ID especially when deletes occur. One method around this is to mark documents as deleted instead of actually deleting them. When you query for the exact document you omit deletes and limit() by one allowing you to get the next closest document to that position like so:

$cur = $db->collection->find(array('ai_id' => array('$gte' => 403454), 'deleted' => 0))->limit(1);
$cursor->next();
$doc = $cursor->current();

How do I choose my range key?

As I said earlier, a timestamp may not always work for you. Choosing the right cardinality and granularity for your range key is very important. A timestamp is not always unique especially on a Facebook wall. You might have two posts that occur the same second. As such when getting the next page you will actually miss this post. So the better key to use here is _id with a sort on timestamp and a compound index on _id and your timestamp field “ts”.

So what is the best key in your scenario? Again I cannot answer that from this post, however, I will try and give a few points and hints on how to choose a good one. I will also try and throw some scenarios in.

First thing to get a combination THAT WILL be unique to each document. So say you are doing some rated items and you want to sort by specific ratings. You would normally create a a compound index and range and use that as your sort as well, like:

$db->find(array('type' => 'video', 'liked' => 1, 'reaction' => 'love'))->sort(array('type' => 1, 'liked' => 1, 'reaction' => 1))

Of course I do not recommend that schema but this is just an example of how a unique range key would work.

The second thing to consider is your sharding key. This is important because you don’t want to scan all computers for a small range that you might be able to find on one computer, right? So make sure to take your shard key into consideration. For example when dong the Facebook feed over an _id range key and ts sort key I would make a shard key of _id and ts to make my ranging as fast and possible.

The problem that can be experienced here, as stated in the comments below, is that you do not hit the shard key when ranging. This still ok since MongoDB will house index ranges and will still be able to hit specific areas of the collection however it will still force a global operation.

You can find a very nice (as @Matt pointed out) pictographic presentation of how queries work in a sharding environment here: http://www.mongodb.org/download/attachments/2097354/how+queries+work+with+sharding.pdf. This will show you how your queries might react on certain shards and under certain conditions.

Now we have covered the absolute basics of how to page without skip() in MongoDB. I am sure there are more parts I have forgotten to add and will probably update this post as time goes on.

Post a comment if you would like further info or see errors.

MongoDB, PHP

Realtime Stats in MongoDB Example (PHP)

Ok so you’ll remember some time ago I talked about realtime statistics in MongoDB, right?

Well here is a working version. This script basically records the basic statistics on videos.

So this time I am going to skip why and how I made my document structure and just show you my document structure:

	/**
	 * Document structure:
	 *
	 * {
	 *   _id: {},
	 *   hits: 0,
	 *   u_hits: 0,
	 *   hours: {
	 *   	1: {v: 5, u: 0},
	 *   	2: {v: 3, u: 9},
	 *   },
	 *   browser: {
	 *   	chrome: 1,
	 *   	ie: 2
	 *   },
	 *   age: {
	 *   	13_16: 0,
	 *   	17_25: 0
	 *   },
	 *   video_comments: 0,
	 *   text_comments: 0,
	 *   video_likes: 0,
	 *   video_dislikes: 0,
	 *   age_total: 0,
	 *   browser_total: 0,
	 *   male: 0,
	 *   female: 0,
	 *   day: 0
	 * }
	 */

This may look very overwhelming at first but if you just take a minute to sit down a comprehend it you’ll realise it is very simple:

  • Every document works only on a day level
  • _id – The unique _id of the document
  • hits – The amount of hits (recurring) for that day
  • u_hits – The unique hits for that day
  • hours – This is an array of hour values (1-24) with another level of complexity within each hour weighting. The sub array within each hour contains a “v” and a “u”. The v is recurring views and the u is unique views.
  • browser – This is basically an array of browser types with the amount used beside each one. This variable is for unique hits only.
  • age – This denotes the age of logged in users who view this video and it separated into age groups which is then $inc’d for each individual unique visit
  • video_comments – This denotes the amount of video comments for that day
  • text_comments – This denote the amount of text comments for that day
  • video_likes – The amount of likes for that day on the video
  • video_dislikes – The amount of dislikes for that day on the video
  • age_total – This is an aggregation field of all age groups summed up in amount
  • browser_title – Same as above but for browsers
  • male – The amount of males who visited this video
  • female – The amount of females who visited this video
  • day – A timestamp of the zero hour of the day

Now for some of the meaty stuff. Lets make a function which can manipulate this schema.

        // This function would be part of say a bigger class called "Video" is whic the object we are collecting stats about, so if you see $this->blah + 1 or whatever it means I am preaggregating the parent variables to make querying stats much easier on things like user video pages.
	function recordHit(){

		if(!is_search_bot($_SERVER['HTTP_USER_AGENT'])){ // Is the user a search bot? is so we don't want to add them

			$user = glue::session()->user;
			$u_brows_key = get_ua_browser();

	        $u_age_key = 'u'; // Uknown age
	        if($_SESSION['logged']){ // If the user is a logged in user lets get some details
		        ///Other, 13-16, 17-25, 26-35, 36-50, 50+
		        $u_age = isset($user->birth_day) && isset($user->birth_month) && isset($user->birth_year) ?
		        	mktime(0, 0, 0, $user->birth_month, $user->birth_day, $user->birth_year) : 0; // Lets get their birthday
				$u_age_diff = (time() - $u_age)/(60*60*24*365); // Lets find the difference between today and then

				switch(true){ // Lets slap them into the right age group
					case $u_age_diff > 12 && $u_age_diff < 17:
						$u_age_key = '13_16';
						break;
					case $u_age_diff > 17 && $u_age_diff < 26:
						$u_age_key = '17_25';
						break;
					case $u_age_diff > 25 && $u_age_diff < 36:
						$u_age_key = '26_35';
						break;
					case $u_age_diff > 35 && $u_age_diff < 51:
						$u_age_key = '36_50';
						break;
					case $u_age_diff > 50:
						$u_age_key = '50_plus';
						break;
				}
	        }

			$is_unique = glue::db()->{video_stats_all}->find(array(
				"sid" => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
				"vid" => $this->_id
			))->count() <= 0; // Is this view unique?

			$update_doc = array( '$inc' => array( 'hits' => 1 ) ); // Lets always add a recurring hit
			if($is_unique){
				$update_doc['$inc']['u_hits'] = 1; // Lets $inc all the document (atomic ops)
				$update_doc['$inc']['age.'.$u_age_key] = 1;
				$update_doc['$inc']['browser.'.$u_brows_key] = 1;

				// These are used to make my life a little easier
				$update_doc['$inc']['age_total'] = 1;
				$update_doc['$inc']['browser_total'] = 1;

				if(glue::session()->user->gender == 'm'){
					$update_doc['$inc']['male'] = 1;
				}elseif(glue::session()->user->gender == 'f'){
					$update_doc['$inc']['female'] = 1;
				}
			}

			$day_update_doc = $update_doc;
			$day_update_doc['$inc']['hours.'.date('G').'.v'] = 1; // Lets $inc the right hour in the day for this guy
			if($is_unique): $day_update_doc['$inc']['hours.'.date('G').'.u'] = 1; endif; // Unique edition of $inc'ing the right hour

			glue::db()->{video_stats_day}->update(array("day"=>new MongoDate(mktime(0, 0, 0, date("m"), date("d"), date("Y"))), "vid"=>$this->_id), $day_update_doc, array("upsert"=>true)); // Lets send down the query!

			if($is_unique){ // If this view is unqiue lets place it witin a all views collection so we can check if the view is unqiue later
				$this->unique_views = $this->unique_views+1;

				glue::db()->video_stats_all->insert(array(
					'sid' => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
					'vid' => $this->_id,
					'ts' => new MongoDate()
				));
			}
			$this->views = $this->views+1; // Lets add pre-aggregation to the parent video object for which this function belongs (in reality you would just add 1 onto whatever parent object you have
			$this->save();

			// Now lets do some referers
			$referer = glue::url()->getNormalisedReferer();
			if($referer){ // We have a referrer who has gone from https://facebook.com/i.php?534986465mmde040 to facebook so we won't fill our table with useless referrers.
				glue::db()->{video_referers}->update(array('video_id' => $this->_id, 'referer' => $referer), array('$inc' => array('c' => 1),
					'$set' => array('ts' => new MongoDate())), array('upsert' => true));
			}
		}
	}

Now let’s write the code to actually get something out of this schema. I have written this with highcharts in mind. I am not going to say this function is easy and what not because it is a bastard and was one to get it working exactly the way I needed it but here goes.

	function getStatistics_dateRange($fromTs /* timestamp */, $toTs /* timestamp */){

                // Lets set some defaults up for the data to be returned
		$unique_views_range = array();
		$non_unique_views_range = array();

		// These totals make my percentage calcing life easier
		$total_browsers = 0;
		$total_ages = 0;

		$sum_browser = array();
		$sum_ages = array();
		$sum_video_comments = 0;
		$sum_text_comments = 0;
		$sum_video_likes = 0;
		$sum_video_dislikes = 0;

		$sum_males = 0;
		$sum_females = 0;

                // If an invalid time range lets default to 7 days ago
		if($fromTs > $toTs){
			$dateFrom = mktime(0, 0, 0, date("m"), date("d")-7, date("Y"));
			$dateTo = time();
		}

		if($fromTs < strtotime('-4 days', $toTs)){ // If this is more than 4 days ago then lets aggregate on day

		  	$newts = $fromTs;
			while ($newts <= $toTs) { // This prefills the graph axis' with the notation we require
				$unique_views_range[$newts] = 0;
				$non_unique_views_range[$newts] = 0;
				$newts = mktime(0,0,0, date('m', $newts), date('d', $newts)+1, date('Y', $newts));
			}

			foreach(glue::db()->{video_stats_day}->find(array(
				"vid"=>$this->_id,
				"day"=> array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs) ),
			)) as $day){ // Lets get our days!
				// Lets pre-aggregate our stuff
				$non_unique_views_range[$day['day']->sec] = $day['hits'] > 0 ? $day['hits'] : 0;
				$unique_views_range[$day['day']->sec] = $day['u_hits'] > 0 ? $day['u_hits'] : 0;

				$sum_browser = summarise_array_row($day['browser'], $sum_browser);
				$sum_ages = summarise_array_row($day['age'], $sum_ages);

				$total_browsers += (int)$day['browser_total'];
				$total_ages += (int)$day['age_total'];

				$sum_video_comments += (int)$day['video_comments'];
				$sum_text_comments += (int)$day['text_comments'];
				$sum_video_likes += (int)$day['video_likes'];
				$sum_video_dislikes += (int)$day['video_dislikes'];

				$sum_males += (int)$day['males'];
				$sum_females += (int)$day['females'];
			}

		}else{ // else with less than 4 days lets aggregate on hours of those days

			$newts = $fromTs;
			while($newts < $toTs){ // Lets pre-fill our graph axis' to the notation we require
				$newts = $newts+(60*60);
				$unique_views_range[$newts] = 0;
				$non_unique_views_range[$newts] = 0;
			}

			foreach(glue::db()->{video_stats_day}->find(array(
				"vid"=>$this->_id,
				"day"=> array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs) ),
			)) as $day){ // Lets get the days
				foreach($day['hours'] as $k => $v){ // Now for each of the hours in those days mark the views as an hour of that day
					$k = $k+1;
					$non_unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['v'] > 0 ? $v['v'] : 0;
					$unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['u'] > 0 ? $v['u'] : 0;
				}

                                // Lets aggregate all the other stats too
				$sum_browser = summarise_array_row($day['browser'], $sum_browser);
				$sum_ages = summarise_array_row($day['age'], $sum_ages);

				$total_browsers += (int)$day['browser_total'];
				$total_ages += (int)$day['age_total'];

				$sum_video_comments += (int)$day['video_comments'];
				$sum_text_comments += (int)$day['text_comments'];
				$sum_video_likes += (int)$day['video_likes'];
				$sum_video_dislikes += (int)$day['video_dislikes'];

				$sum_males += (int)$day['male'];
				$sum_females += (int)$day['female'];
			}
		}

		// Now lets get the browser crap
		$browsers_highCharts_array = array();
		$u_brows_capt = 'Other';
		foreach($sum_browser as $k => $sum){
			if($k =='ie'){
				$u_brows_capt = "IE";
			}elseif($k == 'ff'){
				$u_brows_capt = "Firefox";
			}elseif($k == 'chrome'){
				$u_brows_capt = "Chrome";
			}elseif($k == 'safari'){
				$u_brows_capt = "Safari";
			}elseif($k == 'opera'){
				$u_brows_capt = "Opera";
			}elseif($k == 'netscape'){
				$u_brows_capt = "Netscape";
			}
			$browsers_highCharts_array[] = array($u_brows_capt, ($sum/$total_browsers)*100);
		}

                // And lets understand our age density
		$ages_highCharts_array = array();
		$u_age_capt = 'Unknown';
		foreach($sum_ages as $k => $sum){
			if($k == '13_16'){
				$u_age_capt = '13-16';
			}elseif($k == '17_25'){
				$u_age_capt = '17-25';
			}elseif($k == '26_35'){
				$u_age_capt = '26-35';
			}elseif($k == '36_50'){
				$u_age_capt = '36-50';
			}elseif($k == '50_plus'){
				$u_age_capt = '50+';
			}
			$ages_highCharts_array[] = array($u_age_capt, ($sum/$total_ages)*100);
		}

		if(sizeof($ages_highCharts_array) <= 0){ // Some defaults to stop broken graphs
			$ages_highCharts_array = array(array('None', 100));
		}

		if(sizeof($browsers_highCharts_array) <= 0){ // Some defaults to stop broken graphs
			$browsers_highCharts_array = array(array('None', 100));
		}

		$total_males_females = $sum_males+$sum_females; // Sum of both males and females who visited (vital for pie chart, mm pie)

                // Now lets form the returning array in the format we require!
		return array(
			'hits' => $this->formatHighChart(array(
				"Views"=>$non_unique_views_range,
				"Unique Views"=>$unique_views_range
			)),
			'browsers' => $browsers_highCharts_array,
			'ages' => $ages_highCharts_array,
			'video_comments' => $sum_video_comments,
			'text_comments' => $sum_text_comments,
			'video_likes' => $sum_video_likes,
			'video_dislikes' => $sum_video_dislikes,
			'males' => $sum_males > 0 ? number_format(($total_males_females/$sum_males)*100, 0) : 0,
			'females' => $sum_females > 0 ? number_format(($total_males_females/$sum_females)*100, 0) : 0
		);
	}

And there you have it a very basic aggregation example in MongoDB.

If you have any questions or need clarification or notice errors etc just post a comment below.

Enjoy and happy MongoDB’ing

MongoDB, Web Development

MongoDB Single Key Index or Compound Index

Just got a question on the MongoDB user group as which type of index over multiple fields would be best and why. The user basically had three fields:

  • Type
  • CompanyID
  • Date

And was wondering what the difference would be to using a compound index on all three fields and using a single key index per field. The main difference would be that MongoDB only supports ONE index per query so the query optimiser would only be able to use one of your three indexes rendering the other 2 indexes useless and dead weight in the DB slowing inserts and using valuable indexing space.

As such with that in mind it is of course best to use a compound index to cover all three fields rather than a index on each field.

Also note that queries will be parsed and optimised better if you index in the order you query so for example:

find({ type: '', companyID: {} }).sort({ Date: -1 })

Would mean a version of an optimal index would be of:

ensureIndex({ type: 1, companyId: 1, Date: 1 })

Hope this helps,

MongoDB

Realtime Statistics in MongoDB

There is now an example of this sort of aggregation here: Realtime Stats in MongoDB Example (PHP)

As I troll the MongoDB user group (mongodb-user) I notice a LOT of questions about realtime statisitics handling in Mongo. As I have repeated myself about 100 times over I will write a blog post about it.

Many individuals try and treat Mongo like SQL and say “Ok how do I do big massive aggregated JOINs between massive tables?” The answer, my friends, is that you don’t. Yes that’s right Mongo of course has no JOINs so there is no real way to do aggregated JOINs across many pieces of data, instead Mongo has two methods:

  • Pre-aggregated
  • Map Reduce

As a disclaimer I personally do not advertise the use of MRs at all. I don’t like them to be honest with you so I will not be talking anymore about MRs (I will do some MR articles another day but not right now).

The method which instead I will discuss is pre-aggregation. As we all know when picking data straight from a large collection MongoDB can tend to be much faster than SQL it is specifically designed to be when used this way and is very fast at doing 10’s of queries a second, so why try to use it like a RDB and shove as much as possible into one single call (or god forbid should you fall into the trap of trying to use…stored procedures)?

Lets start with an example of pre-aggregation. Imagine I have a video site with a load of videos on. I want to record all hits on videos within certain time ranges but I also want to store repeated views and unique views so as to understand how many returning viewers a user has over a period of time and how many unique viewers they drew in. I also would like to pre-aggregate stuff like browser type and referrer and find the best referrers to that video over a period of time.

Now this period of time I keep talking about could exist in the form of:

  • all time
  • day
  • month
  • Year

I decide these “time buckets” by saying how I would like to query, the words in brackets explain the time bucket that can handle these queries (I’ll explain that part later):

  • I would like to get hourly stats (all time)
  • I would like to get daily stats (all time)
  • I would like to get stats over days, possibly weeks (day)
  • I would like to get stats over months (month)
  • I would like to get stats over years (Year)

Now Lets think about the collections I would need. To make querying as easy as possible I will break down the time buckets into collections like so:

  • stats.all_time
  • stats.day
  • stats.month
  • stats.year

Note: I have combined the unique and recurring views collections using a “is_unique” boolean field in the document. If your collections grow upto trillions of rows you can split the two.

Ok so you got your time bucket collections in what the time sections you would to query on.

Now lets think about what we would like to store:

  • Unique hit
  • Recurring hit
  • Browser total

So here is an example document I thought up:

{
  _id: {},
  video: {},
  session_id: {},
  hits: 0,
  unique_hit: true // This denotes of the hit is unique, if false then it is recurring
  browsers: { // This array holds the browsers seen over the period of time, we will $inc this according to what we see
    chrome: 0,
    ie: 0,
    firefox: 0,
    other: 0
  },
  ts: number() // This holds a UNIX TIMESTAMP of the time step of the document
}

So we have our collections and a sample of our document. Lets mark updates.

So a new user comes in and gets assigned a session_id for the lifetime of their browser session (cookie lifetime of 0). We need to find out if this session_id has been here before so lets search for this video and session_id in the all_time collections:

var res = db.stats.all_time.find({ video: 2394785, session_id: 34857 });

if(res){ // This user is not unique so we update the not unique documents for day, month and year
// We upsert the documents just incase they are not already there.
// For the purpose of this test we will say they are using the chrome browser to access our lovely pages.

     db.stats.day.update({unique_hit: false, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: false, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: false, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
}else{

     // Insert a new row into the all time so that we cna undersand if this is unique or not next time.
     db.stats.all_time.insert({ session_id: session_id, video: 2394785, ts: new Date() });

     db.stats.day.update({unique_hit: true, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: true, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: true, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
}

And there ypou have it! You can now track a hits on a video and browser types, so for example you wish to show (on a graph) all hits on a video for the last 7 days:

db.stats.day.find({ video: 2394785, ts: {$lt: new Date(), $gt: new Date(//ts of 7 days ago)} })

I know what many will say, “urgh but this means I gotta de-normalise and that’s bad!!!1111” What’s so different in SQL? Yes the query would be one with aggregation etc but then because the query would be so heavy on your server you’ll find that actually you’ll be using memcached or big table designs to actually make it go fast anyway and memcached ain’t realtime bro. So quit your whining and get on it!

Now we got that sorted you will notice that I missed out referrer, why don’t you try this yourself?

If you need anymore explanations etc just post a comment!

Happy coding,