Uncategorized – Kownter

Tracking return visits with The Anonymous Cookie

rosswintle — Thu, 10 May 2018 22:08:28 +0000

I’m still working through the pros and cons of using cookies.

On the one hand, they seem like they will allow some super-useful things at relatively low cost, such as tracking a returning visitor in order to show “visitors” and “unique visitors”.

But adding non-essential cookies (analytics are classed as non-essential) means that, in theory, if you’re in the EU you’re supposed to have informed, but not explicit consent for that cookie, which I read as: “you need a pop-up, but not a checkbox”.

A session cookie is, I believe, an “online identifier” and this is personally identifiable and subject to DPA or GDPR in the UK/EU (though I would argue that because you can’t perform a lookup on the session ID it’s not, but who am I to argue). So session cookies aren’t great for tracking (and wouldn’t scale). But nor is any other unique ID stored in a cookie and sent to the server.

There’s been a great conversation about this topic in a GitHub issue in Kownter’s competitor/collaborator Fathom (I love them, it’s all good). And I’ve realised (yeah, took a while!) that you don’t need an ID in the cookie. You can just set tracked_with_kownter = 1 or whatever, and that will identify a returning visitor to the site you’re tracking.

So the next step is: what can I put in the cookie that is anonymous, but useful?

Adding “simple” features

rosswintle — Thu, 03 May 2018 07:13:43 +0000

I recently added a “simple” toggle so I can show information from different time periods.

BUT…this is one of this things that seems trivial but really isn’t because of the kinds of engineering decisions you might make.

In this case I’m building and learning test-driven development as I go, and I wanted to use this as an excuse to:

try building some Vue.js components
practice using the JavaScript fetch API and promises

And to use fetch I had to write an “API” endpoint to grab the relevant data. So the process for building this was:

write tests for the API endpoints (which, in my case, involved be building some factory classes too – a job that needed doing anyway)
write API endpoint routes
write API endpoint controller methods
create the Vue components
plug it all together with the API calls

(You’ll note that I’m not testing the front end)

Yes, this is modern web development. And yes none of it was particularly difficult. And yes, it all works fine. And yes, having done this once, subsequent, similar functionality will be quicker to build. But I thought this was one of those insights into how a tiny, trivial looking user interface improvement actually has many moving parts.

Competition

rosswintle — Mon, 23 Apr 2018 22:09:01 +0000

In which I feel truly validated.

I know work on Kownter has stalled – there are other things afoot – but then an “excitable nerd” with 22,000 followers on Twitter says this:

What if website analytics software didn't take your users data to line their pockets from advertising? What is website analytics software was simple and trustworthy? Here's my new idea (and if this gets enough traction, I'll build it) https://t.co/TzWZUNfcXe pic.twitter.com/U9yhLWU0iq

— paul jarvis (@pjrvs) April 16, 2018

And suddenly the race for simpler, more privacy-aware analytics is on!

Who am I kidding? Paul and his companion Danny will do a WAY better job than I will. BUT, I have a working prototype so maybe I’m a nose ahead and this will inspire me to keep going. [Update: I’m not a nose ahead – Danny already has a prototype too!]

Thoughts on Paul’s mockup:

There’s return visitors and journey tracking – so presumably they will use cookies? Or some other cleverness that they’ve thought of but I haven’t. This is all walking a line between end-user privacy and functionality. Decisions will need to be made.
There’s no charts! Which I find interesting. I had at the back of my mind that I would add them later on. I like the up/down arrows comparing the previous time period, and YES, I was definitely going to do that!
There’s “Signups”, which is also interesting. I’d been toying with the idea of event tracking. I’m using JS tracking code, so no reason I couldn’t make an API for tracking simple events. OR you can get a ratio of people visiting a form page vs people visiting a thank you page, for example. But this won’t work in more dynamic, JS-driven apps.
There’s no sign of user agents/browsers. I currently store the raw user agent string, but finding a complete, accurate and up-to-date mapping of UA strings to browser names hasn’t been easy. This is something that an organisation the size of Google can put resources into but I can’t, sadly. Perhaps I should drop it? But it’s data I can get and it can be useful so…hmm… [Update: Daniel Aleksandersen pointed me at Piwik/Matomo’s Device Detector]

I absolutely, 100% applaud the fact that someone else is having a go at building this. Go and sign up for their announcements if you’re interested.

Why am I doing this, again?

rosswintle — Sun, 11 Mar 2018 22:41:21 +0000

I’m totally open to the idea that I will probably end up realising that there are too many problems to solve here and I’ll end up using an off-the-shelf, open source tool instead. But I’m plugging on because it’s fun.

I actually now have a very very basic working system that is collecting data from three low-traffic sites. It does basic tracking and has a single site display, but that’s about all right now. Here’s what I’ve got:

There aren’t even graphs at this point. This is not huge progress, but it’s approaching some kind of minimal viable product that I can iterate upon.

Now, I keep asking myself “Why, Ross. Why are you doing this?” and having this MVP actually helps answer some of that.

I like using it

Though it’s incredibly minimal, I actually love using what I’ve built.

It’s fast (for now). It’s simple. I have zero concerns about privacy. I feel like I can deploy this wherever I like. The data is mine. The data is in the UK. I don’t need to tell people I’m using it (do I? someone please correct me if I’m wrong because this is the whole premise of the thing!). I could turn off Google Analytics and the plugins I use in WordPress to control that on a bunch of sites right away.

Aside: I’ve been slimming down use of “commercial” WordPress plugins lately. Jetpack, as an example, has become far too corporate and intrusive in its in-dashboard marketing and upgrade nudges. This isn’t making me upgrade, it’s making me get annoyed and want to turn it off. There’s an analytics plugin that falls into this category too. I get the need to up-sell, but really…get out of my dashboard so much!

Importantly, my sites’ analytics are easy to get to and all in once place. I could accomplish quite of lot of what I have here with server logs and a tool like AWStats that usually comes with cPanel hosting (which I use a bit of), or the excellent GoAccess command-line log analyzer. But I have stuff on different servers with different logins, and having it all quickly and easily accessible from a single place. It’s so easy to just log in and flick through my sites for insights.

It’s got me interested in my site’s analytics again

As a result of particularly this last bit, I’ve become interested in my analytics again. Who knew that my review of my bike was viewed a lot or that some random thing I wrote 6 years ago still crops up in, presumably, search results (I’ll find this out later on when I do per-page reporting).

It’s also served me a reminder that (for secret reasons that I know about but won’t go into publicly) my technical skills page gets a lot of hits too, and it’s really out of date (seriously – don’t read it, it’s old news)!

I must do something about that.

So, yeah. This is fun, interesting and has value. I’ll keep at it and see where it goes.

Should I aggregate counts, if and so, when?

rosswintle — Sun, 11 Mar 2018 21:06:48 +0000

As I build this thing I’m constantly thinking about performance.

I need registering a page view to be as fast and efficient as possible because on a busy site you don’t want to clog the server up with logging views and you want to avoid race conditions where two page views are updating the same bit of data at the same time. (Note, none of MY sites are busy, but I’m thinking this should be a general purpose tool so it needs to be efficient)

I also need reporting to be efficient because we could be dealing with large volumes of data.

There is a scale of performance here: performing some data aggregation when logging a hit would make reporting more efficient, but will slow down view recording and risk race conditions.

Data aggregation would do things like increment a daily/weekly/monthly page count for a site/page/referrer/browser when a page hit is made. This is technically data duplication: you can calculate this information from the raw page-view data, so storing the counts is technically duplicating data and has overheads in both storage and processing. The engineering decision here is are these overheads acceptable ones?

At one end of the scale I can just log a page hit as a row in a table, and put all of the aggregation calculation burden on the reporting end: calculate everything from raw page hits when you’re viewing data.

At the other end I could increment a whole stack of aggregated data when recording a page hit.

I probably won’t have huge amounts of data to deal with when live testing so I may not know which of these works best for a long time.

I could write some massive database seeding in testing, and I probably SHOULD do, but that’s for another post. But I would also need to load-test the system somehow to see how it performs when busy.

And there are half-way houses. I could push aggregation jobs to a queue and defer them to prevent overloading when the system is busy. Or I could run a job every day to aggregate the day’s data. Or every hour. Or every five miutes. But the questions is then: how long does the aggregation take? And that could vary wildly.

For these early iterations of the app I can do things either way and change it later (by deleting aggregated data, or by creating it from the raw data using a batch job).

But this is something that will need to be tested for and decided upon at some point. It feels like a critical engineering decision for large-scale analytics!

Note: I’m assuming a single server installation or a simple database server + app server setup. At scale you could probably have one or more logging servers and a reporting server connected to a database but I’m not considering that just yet.

Also note: one clever trick I’ve come across is to push the server response to the client before doing any processing. I’m not sure this helps with the issues of aggregating when logging a page hit but it prevents a delay in sending the response and is probably something I want to do. Example here.

Difficult Dates

rosswintle — Mon, 05 Mar 2018 09:46:20 +0000

I had an interesting thought about dates which led to me learning something and then feeling a bit stupid.

The thing to bear in mind here is that I’m thinking a lot about how this tool could accumulate a LOT of data, and how I can make reporting fast. Perhaps I should jus trust MySQL, but I don’t.

Laravel, by default, adds created_at and updated_at columns to tables in the database. These are TIMESTAMP typed values and always appear in “YYYY-MM-DD HH:MM:SS” format. I figured I’d use the created_at attribute for the timestamp of the pageview.

But then I was thinking about reporting, and how if, say, I wanted to show views from the last 7 days, this might be inefficient. The reasoning for this was that timestamps appear to be strings like “YYYY-MM-DD HH:MM:SS” as this is how you store them and this is how they are retrieved in queries. In my head I know that a string comparison is going to be slow so querying based on comparisons of this column felt like they could be bad.

So I actually got all the way to adding the current time as an integer timestamp in a separate attribute before stopping and thinking: how does MySQL store TIMESTAMP type values internally. And, it turns out, they’re stored as integers anyway.

I can use Carbon to easily set up the query parameters when querying this and all the comparisons are, internally, done numerically. So that’s cool.

Time to roll back!

More thoughts on HOW to track

rosswintle — Wed, 28 Feb 2018 11:48:52 +0000

There seems to be a few possible ways to do the tracking and I think it broadly falls into two categories:

use JavaScript to send a request
request a resource (image, CSS) using an HTML tag or a CSS property

I talked before about the fact that requesting a resource wouldn’t send the details of the referring page. So I may have both as options with the resource request being easier to add in some cases, and the JS offering more functionality.

But some more questions have come up when thinking about this.

Do I need to send back valid response content?

I want the request and response to be as small as possible. So can I send back an empty response?

A quick bit of research seems to show that I at least need to set the content type (Aside: I wonder how web servers determine this?). And I’ll need to send an HTTP status code back. But if I set that, can the response content itself actually be empty?

How do I prevent caching of responses?

Browsers are pretty good a caching resources these days. So what do I need to do to ensure that a resource response is not cached.

How do I prevent blocking page load?

In what circumstances does loading a resource block a page loading? I don’t want to do this. My request should be asynchronous if possible. So how do I achieve that in all cases?

Requesting CSS doesn’t look like a great idea.

Copying an image-generating bit of PHP looks like a good idea though.

I’m suddenly starting to see that this maybe isn’t as easy as it first seems. Maybe this is why people don’t build their own analytics platforms.

Thoughts on WHAT to track and report

rosswintle — Mon, 26 Feb 2018 09:05:06 +0000

I’ve been thinking about not just HOW to track, but WHAT to track. And these are related. My tracking method will, to some extent, dictate what I can track. For example, using a simple pixel image or reference to a URL in HTML or CSS will not be able to send me the URL of the referring page.

And in meeting my goal of not using cookies and not keeping any personally identifiable information, I won’t be able to track users paths through a website.

This is perfectly OK for some applications. It’s not OK for everyone, but if you need that level of detail, we can still report the ratio of conversions against page views.

I wasn’t going to add event tracking, but maybe I’ll add events after all to help with this. This WILL require a JS tracking code to be installed.

We’ll see. Initially I’m happy with views per page over time, browser usage metrics and referring pages/traffic sources. None of these need personally identifiable information. It’s all anonymised and aggregate.

IP’s are personally identifiable information and will be logged in server logs (unless I turn this off), but I’ve seen it argued that, as long as your careful with log rotation and deletion (including backups), there’s a case for keeping this data temporarily without consent.

Getting going

rosswintle — Sat, 24 Feb 2018 10:04:49 +0000

I’ve now got a fresh install of Laravel, which will be the framework I’ll try to build this on – it’s both the framework I’m most familiar with, and the one I’m trying to learn more about.

I’m going to try to take an approach called “Test Driven Development” (TDD) – which I’m also trying to learn more about. This process involves writing an automated test BEFORE you write the code that makes the application work. So the test you write fails, and then you work until the test passes.

I’ve seen TDD done one test at a time, but I already have an idea of what tests I will need to write, so I’ve gone ahead and added a bunch of tests by name only. Thinks like:

a_page_view_can_be_tracked
a_page_view_for_an_unknown_domain_fails
a_page_view_logs_the_correct_user_agent

These tests aren’t fleshed out yet, they’re reminders of what I need to write. I have some idea of what I will need for this to work. And I’d normally keep these as todos somewhere, but I’m trying to use my test names as my todos here.

I’ve also got some of the code working and tests passing. I can basically track a simple page view right now.

I’ve also created a test for being able to display the results and coded that up to get it passing too.

So, right now, I’ve got something that looks like this:

Choosing how to track

rosswintle — Fri, 23 Feb 2018 19:39:13 +0000

I’ve been trying to work out how best to do the tracking. There seem to be two options:

Request a resource using CSS or an image
Make a JS call

The main disadvantage of the CSS/image approach is that it won’t be able to send a referrer so we won’t be able to see where the page that was loaded was accessed from.

There may also be a performance hit as the request might not load asynchronously. I’ll have to do some testing with this.

The JS method requires JS to be added to the site and loaded, BUT, it will allow the referrer to be passed on. (I’ll have to work out how).

There’s also a question about bots. One slight advantage of JS is that it will not track bots that don’t use JS. What I don’t know is do the bots that fail to load JS also fail to load other resources? Will a crawler hit the tracking endpoint if CSS or an image is used?

I think I’ll probably end up offering both options. With the JS as an enhanced version if you want to implement it.