We have done our Master’s Thesis project about Browser Fingerprinting as an alternative to cookie-based identification methods. Now, I know what you’re thinking; tracking Internet users without them knowing is evil. Well it might be, at least if you’re not transparent about what you collect and why you want to track your users. Another crucial feature which needs be provided not to be evil, is some way for the users to opt-out (e.g. DNT). Anyhow, the ethical issues have not been our main concern for the thesis as we have been looking at the technical challenges.
Our approach was to collect a set of informative features from browsers, and use machine learning methods to distinguish and identify these. As it turns out, tracking Internet users, using only properties from their browsers, does not seem to be impossible. However, getting Internet users to willingly share such properties with us, did seem close to impossible. We created a website to collect data from helpful users to derive methods for identifying fingerprints, but the turnout was not what we expected; from the launch of the site until today, we’ve had 2,062 visits. We did look at other, much larger, data sets, but neither of these contained as many browser properties as the data collected by our site. Despite the lack of interesting data, we were able to draw some conclusions.
Machine learning seems to be the way to go. Browser features changes more or less rapidly, and the identification method must be able to adapt and respond to changed fingerprints. Many machine learning methods are designed to be flexible, and makes the identifier adapt to changes automagically.
Users of different devices, different OSs and different browsers behave significantly different; for instance, handheld devices are much more prone to change their IP than computers, since they are much more mobile (by definition) and therefor move between different networks. Internet Explorer 8 was released on 19 March 2009, at which time Google Chrome was at version 1.0. Today, IE is at version 10, whereas Chrome is at version 19.0. This implies that an identifying method must react differently when the same feature changes depending on which combination of device/OS/browser the fingerprint belongs to. That’s lead us to the conclusion that the data should be partitioned; e.g. so that iPhone fingerprints are not compared to IE fingerprints.
Another important conclusion is that it is easy to over-do it; i.e. creating a far more complex algorithm than the problem calls for. We used a baseline algorithm which we called static fingerprinting, which is basically a hash function mapping each fingerprint to a unique browser. This is obviously not adaptive at all, but it does actually give good, or at least acceptable, results under certain conditions: when identifying browsers over short periods of time and/or when perfect accuracy not required (e.g. in some aggregated measures). The benefit of using a static fingerprinting method is that it is dead simple to implement and it is lightning fast. It is perfect for tracking users in real-time over single sessions, but not as perfect for associating separate sessions to the same unique browser. That is, don’t create a complex algorithm when there’s no need to do so.