Should I aggregate counts, if and so, when?

rosswintle — Sun, 11 Mar 2018 21:06:48 +0000

As I build this thing I’m constantly thinking about performance.

I need registering a page view to be as fast and efficient as possible because on a busy site you don’t want to clog the server up with logging views and you want to avoid race conditions where two page views are updating the same bit of data at the same time. (Note, none of MY sites are busy, but I’m thinking this should be a general purpose tool so it needs to be efficient)

I also need reporting to be efficient because we could be dealing with large volumes of data.

There is a scale of performance here: performing some data aggregation when logging a hit would make reporting more efficient, but will slow down view recording and risk race conditions.

Data aggregation would do things like increment a daily/weekly/monthly page count for a site/page/referrer/browser when a page hit is made. This is technically data duplication: you can calculate this information from the raw page-view data, so storing the counts is technically duplicating data and has overheads in both storage and processing. The engineering decision here is are these overheads acceptable ones?

At one end of the scale I can just log a page hit as a row in a table, and put all of the aggregation calculation burden on the reporting end: calculate everything from raw page hits when you’re viewing data.

At the other end I could increment a whole stack of aggregated data when recording a page hit.

I probably won’t have huge amounts of data to deal with when live testing so I may not know which of these works best for a long time.

I could write some massive database seeding in testing, and I probably SHOULD do, but that’s for another post. But I would also need to load-test the system somehow to see how it performs when busy.

And there are half-way houses. I could push aggregation jobs to a queue and defer them to prevent overloading when the system is busy. Or I could run a job every day to aggregate the day’s data. Or every hour. Or every five miutes. But the questions is then: how long does the aggregation take? And that could vary wildly.

For these early iterations of the app I can do things either way and change it later (by deleting aggregated data, or by creating it from the raw data using a batch job).

But this is something that will need to be tested for and decided upon at some point. It feels like a critical engineering decision for large-scale analytics!

Note: I’m assuming a single server installation or a simple database server + app server setup. At scale you could probably have one or more logging servers and a reporting server connected to a database but I’m not considering that just yet.

Also note: one clever trick I’ve come across is to push the server response to the client before doing any processing. I’m not sure this helps with the issues of aggregating when logging a page hit but it prevents a delay in sending the response and is probably something I want to do. Example here.

Difficult Dates

rosswintle — Mon, 05 Mar 2018 09:46:20 +0000

I had an interesting thought about dates which led to me learning something and then feeling a bit stupid.

The thing to bear in mind here is that I’m thinking a lot about how this tool could accumulate a LOT of data, and how I can make reporting fast. Perhaps I should jus trust MySQL, but I don’t.

Laravel, by default, adds created_at and updated_at columns to tables in the database. These are TIMESTAMP typed values and always appear in “YYYY-MM-DD HH:MM:SS” format. I figured I’d use the created_at attribute for the timestamp of the pageview.

But then I was thinking about reporting, and how if, say, I wanted to show views from the last 7 days, this might be inefficient. The reasoning for this was that timestamps appear to be strings like “YYYY-MM-DD HH:MM:SS” as this is how you store them and this is how they are retrieved in queries. In my head I know that a string comparison is going to be slow so querying based on comparisons of this column felt like they could be bad.

So I actually got all the way to adding the current time as an integer timestamp in a separate attribute before stopping and thinking: how does MySQL store TIMESTAMP type values internally. And, it turns out, they’re stored as integers anyway.

I can use Carbon to easily set up the query parameters when querying this and all the comparisons are, internally, done numerically. So that’s cool.

Time to roll back!

scaling – Kownter

Should I aggregate counts, if and so, when?

Difficult Dates