Ok so you’ll remember some time ago I talked about realtime statistics in MongoDB, right?
Well here is a working version. This script basically records the basic statistics on videos.
So this time I am going to skip why and how I made my document structure and just show you my document structure:
/**
* Document structure:
*
* {
* _id: {},
* hits: 0,
* u_hits: 0,
* hours: {
* 1: {v: 5, u: 0},
* 2: {v: 3, u: 9},
* },
* browser: {
* chrome: 1,
* ie: 2
* },
* age: {
* 13_16: 0,
* 17_25: 0
* },
* video_comments: 0,
* text_comments: 0,
* video_likes: 0,
* video_dislikes: 0,
* age_total: 0,
* browser_total: 0,
* male: 0,
* female: 0,
* day: 0
* }
*/
This may look very overwhelming at first but if you just take a minute to sit down a comprehend it you’ll realise it is very simple:
- Every document works only on a day level
- _id – The unique _id of the document
- hits – The amount of hits (recurring) for that day
- u_hits – The unique hits for that day
- hours – This is an array of hour values (1-24) with another level of complexity within each hour weighting. The sub array within each hour contains a “v” and a “u”. The v is recurring views and the u is unique views.
- browser – This is basically an array of browser types with the amount used beside each one. This variable is for unique hits only.
- age – This denotes the age of logged in users who view this video and it separated into age groups which is then $inc’d for each individual unique visit
- video_comments – This denotes the amount of video comments for that day
- text_comments – This denote the amount of text comments for that day
- video_likes – The amount of likes for that day on the video
- video_dislikes – The amount of dislikes for that day on the video
- age_total – This is an aggregation field of all age groups summed up in amount
- browser_title – Same as above but for browsers
- male – The amount of males who visited this video
- female – The amount of females who visited this video
- day – A timestamp of the zero hour of the day
Now for some of the meaty stuff. Lets make a function which can manipulate this schema.
// This function would be part of say a bigger class called "Video" is whic the object we are collecting stats about, so if you see $this->blah + 1 or whatever it means I am preaggregating the parent variables to make querying stats much easier on things like user video pages.
function recordHit(){
if(!is_search_bot($_SERVER['HTTP_USER_AGENT'])){ // Is the user a search bot? is so we don't want to add them
$user = glue::session()->user;
$u_brows_key = get_ua_browser();
$u_age_key = 'u'; // Uknown age
if($_SESSION['logged']){ // If the user is a logged in user lets get some details
///Other, 13-16, 17-25, 26-35, 36-50, 50+
$u_age = isset($user->birth_day) && isset($user->birth_month) && isset($user->birth_year) ?
mktime(0, 0, 0, $user->birth_month, $user->birth_day, $user->birth_year) : 0; // Lets get their birthday
$u_age_diff = (time() - $u_age)/(60*60*24*365); // Lets find the difference between today and then
switch(true){ // Lets slap them into the right age group
case $u_age_diff > 12 && $u_age_diff < 17:
$u_age_key = '13_16';
break;
case $u_age_diff > 17 && $u_age_diff < 26:
$u_age_key = '17_25';
break;
case $u_age_diff > 25 && $u_age_diff < 36:
$u_age_key = '26_35';
break;
case $u_age_diff > 35 && $u_age_diff < 51:
$u_age_key = '36_50';
break;
case $u_age_diff > 50:
$u_age_key = '50_plus';
break;
}
}
$is_unique = glue::db()->{video_stats_all}->find(array(
"sid" => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
"vid" => $this->_id
))->count() <= 0; // Is this view unique?
$update_doc = array( '$inc' => array( 'hits' => 1 ) ); // Lets always add a recurring hit
if($is_unique){
$update_doc['$inc']['u_hits'] = 1; // Lets $inc all the document (atomic ops)
$update_doc['$inc']['age.'.$u_age_key] = 1;
$update_doc['$inc']['browser.'.$u_brows_key] = 1;
// These are used to make my life a little easier
$update_doc['$inc']['age_total'] = 1;
$update_doc['$inc']['browser_total'] = 1;
if(glue::session()->user->gender == 'm'){
$update_doc['$inc']['male'] = 1;
}elseif(glue::session()->user->gender == 'f'){
$update_doc['$inc']['female'] = 1;
}
}
$day_update_doc = $update_doc;
$day_update_doc['$inc']['hours.'.date('G').'.v'] = 1; // Lets $inc the right hour in the day for this guy
if($is_unique): $day_update_doc['$inc']['hours.'.date('G').'.u'] = 1; endif; // Unique edition of $inc'ing the right hour
glue::db()->{video_stats_day}->update(array("day"=>new MongoDate(mktime(0, 0, 0, date("m"), date("d"), date("Y"))), "vid"=>$this->_id), $day_update_doc, array("upsert"=>true)); // Lets send down the query!
if($is_unique){ // If this view is unqiue lets place it witin a all views collection so we can check if the view is unqiue later
$this->unique_views = $this->unique_views+1;
glue::db()->video_stats_all->insert(array(
'sid' => glue::session()->user->_id instanceof MongoId ? glue::session()->user->_id : session_id(),
'vid' => $this->_id,
'ts' => new MongoDate()
));
}
$this->views = $this->views+1; // Lets add pre-aggregation to the parent video object for which this function belongs (in reality you would just add 1 onto whatever parent object you have
$this->save();
// Now lets do some referers
$referer = glue::url()->getNormalisedReferer();
if($referer){ // We have a referrer who has gone from https://facebook.com/i.php?534986465mmde040 to facebook so we won't fill our table with useless referrers.
glue::db()->{video_referers}->update(array('video_id' => $this->_id, 'referer' => $referer), array('$inc' => array('c' => 1),
'$set' => array('ts' => new MongoDate())), array('upsert' => true));
}
}
}
Now let’s write the code to actually get something out of this schema. I have written this with highcharts in mind. I am not going to say this function is easy and what not because it is a bastard and was one to get it working exactly the way I needed it but here goes.
function getStatistics_dateRange($fromTs /* timestamp */, $toTs /* timestamp */){
// Lets set some defaults up for the data to be returned
$unique_views_range = array();
$non_unique_views_range = array();
// These totals make my percentage calcing life easier
$total_browsers = 0;
$total_ages = 0;
$sum_browser = array();
$sum_ages = array();
$sum_video_comments = 0;
$sum_text_comments = 0;
$sum_video_likes = 0;
$sum_video_dislikes = 0;
$sum_males = 0;
$sum_females = 0;
// If an invalid time range lets default to 7 days ago
if($fromTs > $toTs){
$dateFrom = mktime(0, 0, 0, date("m"), date("d")-7, date("Y"));
$dateTo = time();
}
if($fromTs < strtotime('-4 days', $toTs)){ // If this is more than 4 days ago then lets aggregate on day
$newts = $fromTs;
while ($newts <= $toTs) { // This prefills the graph axis' with the notation we require
$unique_views_range[$newts] = 0;
$non_unique_views_range[$newts] = 0;
$newts = mktime(0,0,0, date('m', $newts), date('d', $newts)+1, date('Y', $newts));
}
foreach(glue::db()->{video_stats_day}->find(array(
"vid"=>$this->_id,
"day"=> array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs) ),
)) as $day){ // Lets get our days!
// Lets pre-aggregate our stuff
$non_unique_views_range[$day['day']->sec] = $day['hits'] > 0 ? $day['hits'] : 0;
$unique_views_range[$day['day']->sec] = $day['u_hits'] > 0 ? $day['u_hits'] : 0;
$sum_browser = summarise_array_row($day['browser'], $sum_browser);
$sum_ages = summarise_array_row($day['age'], $sum_ages);
$total_browsers += (int)$day['browser_total'];
$total_ages += (int)$day['age_total'];
$sum_video_comments += (int)$day['video_comments'];
$sum_text_comments += (int)$day['text_comments'];
$sum_video_likes += (int)$day['video_likes'];
$sum_video_dislikes += (int)$day['video_dislikes'];
$sum_males += (int)$day['males'];
$sum_females += (int)$day['females'];
}
}else{ // else with less than 4 days lets aggregate on hours of those days
$newts = $fromTs;
while($newts < $toTs){ // Lets pre-fill our graph axis' to the notation we require
$newts = $newts+(60*60);
$unique_views_range[$newts] = 0;
$non_unique_views_range[$newts] = 0;
}
foreach(glue::db()->{video_stats_day}->find(array(
"vid"=>$this->_id,
"day"=> array("\$gte" => new MongoDate($fromTs), "\$lte" => new MongoDate($toTs) ),
)) as $day){ // Lets get the days
foreach($day['hours'] as $k => $v){ // Now for each of the hours in those days mark the views as an hour of that day
$k = $k+1;
$non_unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['v'] > 0 ? $v['v'] : 0;
$unique_views_range[mktime($k, 0, 0, date('m', $day['day']->sec), date('d', $day['day']->sec), date('Y', $day['day']->sec))] = $v['u'] > 0 ? $v['u'] : 0;
}
// Lets aggregate all the other stats too
$sum_browser = summarise_array_row($day['browser'], $sum_browser);
$sum_ages = summarise_array_row($day['age'], $sum_ages);
$total_browsers += (int)$day['browser_total'];
$total_ages += (int)$day['age_total'];
$sum_video_comments += (int)$day['video_comments'];
$sum_text_comments += (int)$day['text_comments'];
$sum_video_likes += (int)$day['video_likes'];
$sum_video_dislikes += (int)$day['video_dislikes'];
$sum_males += (int)$day['male'];
$sum_females += (int)$day['female'];
}
}
// Now lets get the browser crap
$browsers_highCharts_array = array();
$u_brows_capt = 'Other';
foreach($sum_browser as $k => $sum){
if($k =='ie'){
$u_brows_capt = "IE";
}elseif($k == 'ff'){
$u_brows_capt = "Firefox";
}elseif($k == 'chrome'){
$u_brows_capt = "Chrome";
}elseif($k == 'safari'){
$u_brows_capt = "Safari";
}elseif($k == 'opera'){
$u_brows_capt = "Opera";
}elseif($k == 'netscape'){
$u_brows_capt = "Netscape";
}
$browsers_highCharts_array[] = array($u_brows_capt, ($sum/$total_browsers)*100);
}
// And lets understand our age density
$ages_highCharts_array = array();
$u_age_capt = 'Unknown';
foreach($sum_ages as $k => $sum){
if($k == '13_16'){
$u_age_capt = '13-16';
}elseif($k == '17_25'){
$u_age_capt = '17-25';
}elseif($k == '26_35'){
$u_age_capt = '26-35';
}elseif($k == '36_50'){
$u_age_capt = '36-50';
}elseif($k == '50_plus'){
$u_age_capt = '50+';
}
$ages_highCharts_array[] = array($u_age_capt, ($sum/$total_ages)*100);
}
if(sizeof($ages_highCharts_array) <= 0){ // Some defaults to stop broken graphs
$ages_highCharts_array = array(array('None', 100));
}
if(sizeof($browsers_highCharts_array) <= 0){ // Some defaults to stop broken graphs
$browsers_highCharts_array = array(array('None', 100));
}
$total_males_females = $sum_males+$sum_females; // Sum of both males and females who visited (vital for pie chart, mm pie)
// Now lets form the returning array in the format we require!
return array(
'hits' => $this->formatHighChart(array(
"Views"=>$non_unique_views_range,
"Unique Views"=>$unique_views_range
)),
'browsers' => $browsers_highCharts_array,
'ages' => $ages_highCharts_array,
'video_comments' => $sum_video_comments,
'text_comments' => $sum_text_comments,
'video_likes' => $sum_video_likes,
'video_dislikes' => $sum_video_dislikes,
'males' => $sum_males > 0 ? number_format(($total_males_females/$sum_males)*100, 0) : 0,
'females' => $sum_females > 0 ? number_format(($total_males_females/$sum_females)*100, 0) : 0
);
}
And there you have it a very basic aggregation example in MongoDB.
If you have any questions or need clarification or notice errors etc just post a comment below.
Enjoy and happy MongoDB’ing