Should I aggregate counts, if and so, when?
As I build this thing I’m constantly thinking about performance.
I need registering a page view to be as fast and efficient as possible because on a busy site you don’t want to clog the server up with logging views and you want to avoid race conditions where two page views are updating the same bit of data at the same time. (Note, none of MY sites are busy, but I’m thinking this should be a general purpose tool so it needs to be efficient)
I also need reporting to be efficient because we could be dealing with large volumes of data.
There is a scale of performance here: performing some data aggregation when logging a hit would make reporting more efficient, but will slow down view recording and risk race conditions.
Data aggregation would do things like increment a daily/weekly/monthly page count for a site/page/referrer/browser when a page hit is made. This is technically data duplication: you can calculate this information from the raw page-view data, so storing the counts is technically duplicating data and has overheads in both storage and processing. The engineering decision here is are these overheads acceptable ones?
At one end of the scale I can just log a page hit as a row in a table, and put all of the aggregation calculation burden on the reporting end: calculate everything from raw page hits when you’re viewing data.
At the other end I could increment a whole stack of aggregated data when recording a page hit.
I probably won’t have huge amounts of data to deal with when live testing so I may not know which of these works best for a long time.
I could write some massive database seeding in testing, and I probably SHOULD do, but that’s for another post. But I would also need to load-test the system somehow to see how it performs when busy.
And there are half-way houses. I could push aggregation jobs to a queue and defer them to prevent overloading when the system is busy. Or I could run a job every day to aggregate the day’s data. Or every hour. Or every five miutes. But the questions is then: how long does the aggregation take? And that could vary wildly.
For these early iterations of the app I can do things either way and change it later (by deleting aggregated data, or by creating it from the raw data using a batch job).
But this is something that will need to be tested for and decided upon at some point. It feels like a critical engineering decision for large-scale analytics!
Note: I’m assuming a single server installation or a simple database server + app server setup. At scale you could probably have one or more logging servers and a reporting server connected to a database but I’m not considering that just yet.
Also note: one clever trick I’ve come across is to push the server response to the client before doing any processing. I’m not sure this helps with the issues of aggregating when logging a page hit but it prevents a delay in sending the response and is probably something I want to do. Example here.