analytics – Kownter

Why am I doing this, again?

rosswintle — Sun, 11 Mar 2018 22:41:21 +0000

I’m totally open to the idea that I will probably end up realising that there are too many problems to solve here and I’ll end up using an off-the-shelf, open source tool instead. But I’m plugging on because it’s fun.

I actually now have a very very basic working system that is collecting data from three low-traffic sites. It does basic tracking and has a single site display, but that’s about all right now. Here’s what I’ve got:

There aren’t even graphs at this point. This is not huge progress, but it’s approaching some kind of minimal viable product that I can iterate upon.

Now, I keep asking myself “Why, Ross. Why are you doing this?” and having this MVP actually helps answer some of that.

I like using it

Though it’s incredibly minimal, I actually love using what I’ve built.

It’s fast (for now). It’s simple. I have zero concerns about privacy. I feel like I can deploy this wherever I like. The data is mine. The data is in the UK. I don’t need to tell people I’m using it (do I? someone please correct me if I’m wrong because this is the whole premise of the thing!). I could turn off Google Analytics and the plugins I use in WordPress to control that on a bunch of sites right away.

Aside: I’ve been slimming down use of “commercial” WordPress plugins lately. Jetpack, as an example, has become far too corporate and intrusive in its in-dashboard marketing and upgrade nudges. This isn’t making me upgrade, it’s making me get annoyed and want to turn it off. There’s an analytics plugin that falls into this category too. I get the need to up-sell, but really…get out of my dashboard so much!

Importantly, my sites’ analytics are easy to get to and all in once place. I could accomplish quite of lot of what I have here with server logs and a tool like AWStats that usually comes with cPanel hosting (which I use a bit of), or the excellent GoAccess command-line log analyzer. But I have stuff on different servers with different logins, and having it all quickly and easily accessible from a single place. It’s so easy to just log in and flick through my sites for insights.

It’s got me interested in my site’s analytics again

As a result of particularly this last bit, I’ve become interested in my analytics again. Who knew that my review of my bike was viewed a lot or that some random thing I wrote 6 years ago still crops up in, presumably, search results (I’ll find this out later on when I do per-page reporting).

It’s also served me a reminder that (for secret reasons that I know about but won’t go into publicly) my technical skills page gets a lot of hits too, and it’s really out of date (seriously – don’t read it, it’s old news)!

I must do something about that.

So, yeah. This is fun, interesting and has value. I’ll keep at it and see where it goes.

Should I aggregate counts, if and so, when?

rosswintle — Sun, 11 Mar 2018 21:06:48 +0000

As I build this thing I’m constantly thinking about performance.

I need registering a page view to be as fast and efficient as possible because on a busy site you don’t want to clog the server up with logging views and you want to avoid race conditions where two page views are updating the same bit of data at the same time. (Note, none of MY sites are busy, but I’m thinking this should be a general purpose tool so it needs to be efficient)

I also need reporting to be efficient because we could be dealing with large volumes of data.

There is a scale of performance here: performing some data aggregation when logging a hit would make reporting more efficient, but will slow down view recording and risk race conditions.

Data aggregation would do things like increment a daily/weekly/monthly page count for a site/page/referrer/browser when a page hit is made. This is technically data duplication: you can calculate this information from the raw page-view data, so storing the counts is technically duplicating data and has overheads in both storage and processing. The engineering decision here is are these overheads acceptable ones?

At one end of the scale I can just log a page hit as a row in a table, and put all of the aggregation calculation burden on the reporting end: calculate everything from raw page hits when you’re viewing data.

At the other end I could increment a whole stack of aggregated data when recording a page hit.

I probably won’t have huge amounts of data to deal with when live testing so I may not know which of these works best for a long time.

I could write some massive database seeding in testing, and I probably SHOULD do, but that’s for another post. But I would also need to load-test the system somehow to see how it performs when busy.

And there are half-way houses. I could push aggregation jobs to a queue and defer them to prevent overloading when the system is busy. Or I could run a job every day to aggregate the day’s data. Or every hour. Or every five miutes. But the questions is then: how long does the aggregation take? And that could vary wildly.

For these early iterations of the app I can do things either way and change it later (by deleting aggregated data, or by creating it from the raw data using a batch job).

But this is something that will need to be tested for and decided upon at some point. It feels like a critical engineering decision for large-scale analytics!

Note: I’m assuming a single server installation or a simple database server + app server setup. At scale you could probably have one or more logging servers and a reporting server connected to a database but I’m not considering that just yet.

Also note: one clever trick I’ve come across is to push the server response to the client before doing any processing. I’m not sure this helps with the issues of aggregating when logging a page hit but it prevents a delay in sending the response and is probably something I want to do. Example here.

Thoughts on WHAT to track and report

rosswintle — Mon, 26 Feb 2018 09:05:06 +0000

I’ve been thinking about not just HOW to track, but WHAT to track. And these are related. My tracking method will, to some extent, dictate what I can track. For example, using a simple pixel image or reference to a URL in HTML or CSS will not be able to send me the URL of the referring page.

And in meeting my goal of not using cookies and not keeping any personally identifiable information, I won’t be able to track users paths through a website.

This is perfectly OK for some applications. It’s not OK for everyone, but if you need that level of detail, we can still report the ratio of conversions against page views.

I wasn’t going to add event tracking, but maybe I’ll add events after all to help with this. This WILL require a JS tracking code to be installed.

We’ll see. Initially I’m happy with views per page over time, browser usage metrics and referring pages/traffic sources. None of these need personally identifiable information. It’s all anonymised and aggregate.

IP’s are personally identifiable information and will be logged in server logs (unless I turn this off), but I’ve seen it argued that, as long as your careful with log rotation and deletion (including backups), there’s a case for keeping this data temporarily without consent.

Getting going

rosswintle — Sat, 24 Feb 2018 10:04:49 +0000

I’ve now got a fresh install of Laravel, which will be the framework I’ll try to build this on – it’s both the framework I’m most familiar with, and the one I’m trying to learn more about.

I’m going to try to take an approach called “Test Driven Development” (TDD) – which I’m also trying to learn more about. This process involves writing an automated test BEFORE you write the code that makes the application work. So the test you write fails, and then you work until the test passes.

I’ve seen TDD done one test at a time, but I already have an idea of what tests I will need to write, so I’ve gone ahead and added a bunch of tests by name only. Thinks like:

a_page_view_can_be_tracked
a_page_view_for_an_unknown_domain_fails
a_page_view_logs_the_correct_user_agent

These tests aren’t fleshed out yet, they’re reminders of what I need to write. I have some idea of what I will need for this to work. And I’d normally keep these as todos somewhere, but I’m trying to use my test names as my todos here.

I’ve also got some of the code working and tests passing. I can basically track a simple page view right now.

I’ve also created a test for being able to display the results and coded that up to get it passing too.

So, right now, I’ve got something that looks like this: