MongoDB, PHP

Real-time Statistics in MongoDB

So, how do we do aggregation in a technology with no JOINs? How do companies like Digg replace their RDBMS with MongoDB for aggregation and analytics? It is quite simple, really, there are two main approaches:

  • Pre-aggregation (includes bucket aggregation and other methods)
  • Map Reduce (MR)

Disclaimer: I personally do not advertise the use of Map Reduce except for background processing between aggregating database nodes. Mostly due to speed, but there are other considerations.

The method which I will discuss is pre-aggregation. When picking data straight from a large collection MongoDB can tend to be much faster than SQL. MongoDB is specifically designed to be very fast at doing 100s of queries a second. DO NOT TRY AND USE MONGODB LIKE A RDB AND LOOK INTO STORED PROCEDURES!

Let’s start, imagine I have a video site with a load of videos on. I want to record all hits on videos within certain time ranges but I also want to store repeated views and unique views so as to understand how many returning viewers a user has over a period of time and how many unique viewers they drew in. I also would like to pre-aggregate stuff like browser type and referrer and find the best referrers to that video over a period of time – “All Time”, “Day”, “Month”, “Year”.

I decide these “time buckets” by asking myself how I would like to query, the words in brackets explain the time bucket that can handle these queries (I’ll explain that part later):

  • I would like to get hourly stats (all time)
  • I would like to get daily stats (all time)
  • I would like to get stats over days, possibly weeks (day)
  • I would like to get stats over months (month)
  • I would like to get stats over years (Year)

So, we are sure of our time buckets. Let’s make our collections:

  • stats.all_time
  • stats.day
  • stats.month
  • stats.year

Note: I have combined the unique and recurring views collections using a is_unique field in the document. If your collections grow up to trillions of rows you can split the two.

Now, let’s think about what we would like to store:

  • Unique hit
  • Recurring hit
  • Browser total

Here is an example document I thought up of earlier:

{
  _id: {},
  video: {},
  session_id: {},
  hits: 0,
  unique_hit: true // This denotes of the hit is unique, if false then it is recurring
  browsers: { // This array holds the browsers seen over the period of time, we will $inc this according to what we see
    chrome: 0,
    ie: 0,
    firefox: 0,
    other: 0
  },
  ts: number() // This holds a UNIX TIMESTAMP of the time step of the document
}

Now for the code, a new user comes in and gets assigned a session_id for the lifetime of their browser session (cookie lifetime of 0). We need to find out if this session_id has been here before so lets search for this video and session_id in the all_time collections:

var res = db.stats.all_time.find({ video: 2394785, session_id: 34857 });

if (res) { 
    // This user is not unique so we update the not unique documents for day, month and year
    // We upsert the documents just incase they are not already there.
    // For the purpose of this test we will say they are using the chrome browser to access our lovely pages.

     db.stats.day.update({unique_hit: false, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: false, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: false, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
} else {

     // Insert a new row into the all time so that we cna undersand if this is unique or not next time.
     db.stats.all_time.insert({ session_id: session_id, video: 2394785, ts: new Date() });

     db.stats.day.update({unique_hit: true, ts: //ts of 0 hour of day}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.month.update({unique_hit: true, ts: //ts of 0 day of month}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
     db.stats.year.update({unique_hit: true, ts: //ts of year only}, {$inc: {hits: 1, chrome: 1}}, {upsert: true});
}

And there you have it! You can now track hits on a video and the browser types, so, for example; you wish to show all hits on a video for the last 7 days:

db.stats.day.find({ video: 2394785, ts: {$lt: new Date(), $gt: new Date(//ts of 7 days ago)} })

I know what many will say, “urgh but this means I gotta de-normalise and that’s bad!!!1111” What’s so different in SQL? Yes, the query would be one with aggregation etc but then because the query would be so heavy on your server you’ll find that actually you’ll be using memcached or big table designs (no JOINs) to actually make it go fast anyway and memcached isn’t real-time, bro’. So, quit your whining and get on it!

Example

I have taken a working version straight out of one of my projects. It will be missing some dependencies but the core logic should be intact:

	/**
	 * Document structure:
	 *
	 * {
	 *   _id: {},
	 *   hits: 0,
	 *   u_hits: 0,
	 *   hours: {
	 *   	1: {v: 5, u: 0},
	 *   	2: {v: 3, u: 9},
	 *   },
	 *   browser: {
	 *   	chrome: 1,
	 *   	ie: 2
	 *   },
	 *   age: {
	 *   	13_16: 0,
	 *   	17_25: 0
	 *   },
	 *   video_comments: 0,
	 *   text_comments: 0,
	 *   video_likes: 0,
	 *   video_dislikes: 0,
	 *   age_total: 0,
	 *   browser_total: 0,
	 *   male: 0,
	 *   female: 0,
	 *   day: 0
	 * }
	 */

And these are the field definitions for the day level document:

  • _id – The unique _id of the document
  • hits – The amount of hits (recurring) for that day
  • u_hits – The unique hits for that day
  • hours – This is an array of hour values (1-24) with another level of complexity within each hour weighting. The sub array within each hour contains a “v” and a “u”. The v is recurring views and the u is unique views.
  • browser – This is basically an array of browser types with the amount used beside each one. This variable is for unique hits only.
  • age – This denotes the age of logged in users who view this video and it separated into age groups which is then $inc’d for each individual unique visit
  • video_comments – This denotes the amount of video comments for that day
  • text_comments – This denotes the amount of text comments for that day
  • video_likes – The amount of likes for that day on the video
  • video_dislikes – The amount of dislikes for that day on the video
  • age_total – This is an aggregation field of all age groups summed up in amount
  • browser_title – Same as above but for browsers
  • male – The amount of males who visited this video
  • female – The amount of females who visited this video
  • day – A time stamp of the zero hour of the day

Now, for some of the meaty stuff. Lets make a function which can manipulate this schema:

// This function would be part of say a bigger class called "Video" is whic the object we are collecting stats about, so if you see $this->blah + 1 or whatever it means I am preaggregating the parent variables to make querying stats much easier on things like user video pages.
function recordHit()
{

    if (!is_search_bot($_SERVER['HTTP_USER_AGENT'])) { // Is the user a search bot? is so we don't want to add them

        $user = glue::session()->user;
        $u_brows_key = get_ua_browser();

        $u_age_key = 'u'; // Uknown age
        if ($_SESSION['logged']) { // If the user is a logged in user lets get some details
            ///Other, 13-16, 17-25, 26-35, 36-50, 50+
            $u_age = isset($user->birth_day) && isset($user->birth_month) && isset($user->birth_year) ?
                mktime(0, 0, 0, $user->birth_month, $user->birth_day, $user->birth_year) : 0; // Lets get their birthday
            $u_age_diff = (time() - $u_age) / (60 * 60 * 24 * 365); // Lets find the difference between today and then

            switch (true) { // Lets slap them into the right age group
                case $u_age_diff > 12 && $u_age_diff < 17:
                    $u_age_key = '13_16';
                    break;
                case $u_age_diff > 17 && $u_age_diff < 26:
                    $u_age_key = '17_25';
                    break;
                case $u_age_diff > 25 && $u_age_diff < 36:
                    $u_age_key = '26_35';
                    break;
                case $u_age_diff > 35 && $u_age_diff < 51:
                    $u_age_key = '36_50';
                    break;
                case $u_age_diff > 50:
                    $u_age_key = '50_plus';
                    break;
            }
        }

        $is_unique = glue::db()->{video_stats_all}->find(array(
                "sid" => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
                "vid" => $this->_id
            ))->count() <= 0; // Is this view unique?

        $update_doc = array('$inc' => array('hits' => 1)); // Lets always add a recurring hit
        if ($is_unique) {
            $update_doc['$inc']['u_hits'] = 1; // Lets $inc all the document (atomic ops)
            $update_doc['$inc']['age.' . $u_age_key] = 1;
            $update_doc['$inc']['browser.' . $u_brows_key] = 1;

            // These are used to make my life a little easier
            $update_doc['$inc']['age_total'] = 1;
            $update_doc['$inc']['browser_total'] = 1;

            if (glue::session()->user->gender == 'm') {
                $update_doc['$inc']['male'] = 1;
            } elseif (glue::session()->user->gender == 'f') {
                $update_doc['$inc']['female'] = 1;
            }
        }

        $day_update_doc = $update_doc;
        $day_update_doc['$inc']['hours.' . date('G') . '.v'] = 1; // Lets $inc the right hour in the day for this guy
        if ($is_unique): $day_update_doc['$inc']['hours.' . date('G') . '.u'] = 1; endif; // Unique edition of $inc'ing the right hour

        glue::db()->{video_stats_day}->update(array("day" => new MongoDate(mktime(0, 0, 0, date("m"), date("d"), date("Y"))), "vid" => $this->_id), $day_update_doc, array("upsert" => true)); // Lets send down the query!

        if ($is_unique) { // If this view is unqiue lets place it witin a all views collection so we can check if the view is unqiue later
            $this->unique_views = $this->unique_views + 1;

            glue::db()->video_stats_all->insert(array(
                'sid' => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
                'vid' => $this->_id,
                'ts' => new MongoDate()
            ));
        }
        $this->views = $this->views + 1; // Lets add pre-aggregation to the parent video object for which this function belongs (in reality you would just add 1 onto whatever parent object you have
        $this->save();

        // Now lets do some referers
        $referer = glue::url()->getNormalisedReferer();
        if ($referer) { // We have a referrer who has gone from https://facebook.com/i.php?534986465mmde040 to facebook so we won't fill our table with useless referrers.
            glue::db()->{video_referers}->update(array('video_id' => $this->_id, 'referer' => $referer), array('$inc' => array('c' => 1),
                '$set' => array('ts' => new MongoDate())), array('upsert' => true));
        }
    }
}

Now, let’s write the code to actually get something out of this schema. I have written this with highcharts in mind:

function getStatistics_dateRange($fromTs /* timestamp */, $toTs /* timestamp */)
{
    // Lets set some defaults up for the data to be returned
    $unique_views_range = array();
    $non_unique_views_range = array();

    // These totals make my percentage calcing life easier
    $total_browsers = 0;
    $total_ages = 0;

    $sum_browser = array();
    $sum_ages = array();
    $sum_video_comments = 0;
    $sum_text_comments = 0;
    $sum_video_likes = 0;
    $sum_video_dislikes = 0;

    $sum_males = 0;
    $sum_females = 0;

    // If an invalid time range lets default to 7 days ago
    if ($fromTs > $toTs) {
        $dateFrom = mktime(0, 0, 0, date("m"), date("d") - 7, date("Y"));
        $dateTo = time();
    }

    if ($fromTs < strtotime('-4 days', $toTs)) { // If this is more than 4 days ago then lets aggregate on day

        $newts = $fromTs;
        while ($newts <= $toTs) { // This prefills the graph axis' with the notation we require
            $unique_views_range[$newts] = 0;
            $non_unique_views_range[$newts] = 0;
            $newts = mktime(0, 0, 0, date('m', $newts), date('d', $newts) + 1, date('Y', $newts));
        }

        foreach (glue::db()->{video_stats_day}->find(array(
            "vid" => $this->_id,
            "day" => array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs)),
        )) as $day) { // Lets get our days!
            // Lets pre-aggregate our stuff
            $non_unique_views_range[$day['day']->sec] = $day['hits'] > 0 ? $day['hits'] : 0;
            $unique_views_range[$day['day']->sec] = $day['u_hits'] > 0 ? $day['u_hits'] : 0;

            $sum_browser = summarise_array_row($day['browser'], $sum_browser);
            $sum_ages = summarise_array_row($day['age'], $sum_ages);

            $total_browsers += (int)$day['browser_total'];
            $total_ages += (int)$day['age_total'];

            $sum_video_comments += (int)$day['video_comments'];
            $sum_text_comments += (int)$day['text_comments'];
            $sum_video_likes += (int)$day['video_likes'];
            $sum_video_dislikes += (int)$day['video_dislikes'];

            $sum_males += (int)$day['males'];
            $sum_females += (int)$day['females'];
        }

    } else { // else with less than 4 days lets aggregate on hours of those days

        $newts = $fromTs;
        while ($newts < $toTs) { // Lets pre-fill our graph axis' to the notation we require
            $newts = $newts + (60 * 60);
            $unique_views_range[$newts] = 0;
            $non_unique_views_range[$newts] = 0;
        }

        foreach (glue::db()->{video_stats_day}->find(array(
            "vid" => $this->_id,
            "day" => array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs)),
        )) as $day) { // Lets get the days
            foreach ($day['hours'] as $k => $v) { // Now for each of the hours in those days mark the views as an hour of that day
                $k = $k + 1;
                $non_unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['v'] > 0 ? $v['v'] : 0;
                $unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['u'] > 0 ? $v['u'] : 0;
            }

            // Lets aggregate all the other stats too
            $sum_browser = summarise_array_row($day['browser'], $sum_browser);
            $sum_ages = summarise_array_row($day['age'], $sum_ages);

            $total_browsers += (int)$day['browser_total'];
            $total_ages += (int)$day['age_total'];

            $sum_video_comments += (int)$day['video_comments'];
            $sum_text_comments += (int)$day['text_comments'];
            $sum_video_likes += (int)$day['video_likes'];
            $sum_video_dislikes += (int)$day['video_dislikes'];

            $sum_males += (int)$day['male'];
            $sum_females += (int)$day['female'];
        }
    }

    // Now lets get the browser crap
    $browsers_highCharts_array = array();
    $u_brows_capt = 'Other';
    foreach ($sum_browser as $k => $sum) {
        if ($k == 'ie') {
            $u_brows_capt = "IE";
        } elseif ($k == 'ff') {
            $u_brows_capt = "Firefox";
        } elseif ($k == 'chrome') {
            $u_brows_capt = "Chrome";
        } elseif ($k == 'safari') {
            $u_brows_capt = "Safari";
        } elseif ($k == 'opera') {
            $u_brows_capt = "Opera";
        } elseif ($k == 'netscape') {
            $u_brows_capt = "Netscape";
        }
        $browsers_highCharts_array[] = array($u_brows_capt, ($sum / $total_browsers) * 100);
    }

    // And lets understand our age density
    $ages_highCharts_array = array();
    $u_age_capt = 'Unknown';
    foreach ($sum_ages as $k => $sum) {
        if ($k == '13_16') {
            $u_age_capt = '13-16';
        } elseif ($k == '17_25') {
            $u_age_capt = '17-25';
        } elseif ($k == '26_35') {
            $u_age_capt = '26-35';
        } elseif ($k == '36_50') {
            $u_age_capt = '36-50';
        } elseif ($k == '50_plus') {
            $u_age_capt = '50+';
        }
        $ages_highCharts_array[] = array($u_age_capt, ($sum / $total_ages) * 100);
    }

    if (sizeof($ages_highCharts_array) <= 0) { // Some defaults to stop broken graphs
        $ages_highCharts_array = array(array('None', 100));
    }

    if (sizeof($browsers_highCharts_array) <= 0) { // Some defaults to stop broken graphs
        $browsers_highCharts_array = array(array('None', 100));
    }

    $total_males_females = $sum_males + $sum_females; // Sum of both males and females who visited (vital for pie chart, mm pie)

    // Now lets form the returning array in the format we require!
    return array(
        'hits' => $this->formatHighChart(array(
            "Views" => $non_unique_views_range,
            "Unique Views" => $unique_views_range
        )),
        'browsers' => $browsers_highCharts_array,
        'ages' => $ages_highCharts_array,
        'video_comments' => $sum_video_comments,
        'text_comments' => $sum_text_comments,
        'video_likes' => $sum_video_likes,
        'video_dislikes' => $sum_video_dislikes,
        'males' => $sum_males > 0 ? number_format(($total_males_females / $sum_males) * 100, 0) : 0,
        'females' => $sum_females > 0 ? number_format(($total_males_females / $sum_females) * 100, 0) : 0
    );
}

And there you have it a very basic aggregation example in MongoDB.

If you need more explanation just post a comment!

Advertisements

One thought on “Real-time Statistics in MongoDB

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s