I was watching Dave Borghuis' fine talk: How I made the municipality pay a 600.000 euro fine for invading your privacy.
But before I start with the topic of this post, let me say that Dave is more than a "Privacy activist / worried & critical citizen", he is in fact the shining light in the Dutch "hacker"-scene, a true role-model. He takes responsibility, protects privacy and approaches not only technology but also authority with mistrust. Latter being an important point in traditional hacker ethics, which I believe we need a lot more (besides people like Dave) in the Netherlands.
But let us dive into the topic of this post. The Dutch city of Enschede has been fined 600 thousand euros by the Dutch data-protection agency for tracking citizens with no proportional purpose or legitimate interest for this data collection/processing. They use the CityTraffic method developed by the RMC consulting company.
The CityTraffic method collects WiFi MAC addresses in beacons hashes them with sha256 and truncates it first to the first 12 hex digits of the hash, and after 24 hours drops 3 more hex digits from the end of the hash, keeping only 9 digits.
A traditional MAC address has two parts:
- the Organizationally Unique Identifier (OUI): 24-bit number in networking equipment that uniquely identifies its seller/producer.
- the remaining 24 bits are assigned by the vendor identified by the OUI to uniquely identify the device.
If you download the list of OUIs from the many places on the internetz, you will find that there is currently about 46 thousand entries in that list (46302 in my copy). This is about ~15.49 (log_2(46302))bit of information, combined with the 24 bits of unique device id that brings us to maximum ~39.5 bits of entropy.
At the end of the talk I used the chance to ask Dave if he considered that these truncated hashes can indeed be reversed by using a pre-computed rainbow table? I even postulated that it might be possible to "reverse" the hashes with GPU/FPGA assisted password crackers like hashcat or john the ripper. Today I tried it out on an Nvidia Jetson AGX Xavier, a small development board that comes with 512 NVIDIA CUDA cores for machine learning - or password cracking. This is a low-power board that consumes maybe up to 30W or so, nothing really powerful.
Session..........: hashcat Status...........: Cracked Hash.Mode........: 1400 (SHA2-256) Hash.Target......: 08..47 Time.Started.....: Wed Jul 27 12:51:16 2022 (12 mins, 51 secs) Time.Estimated...: Wed Jul 27 13:04:07 2022 (0 secs) Kernel.Feature...: Optimized Kernel Guess.Base.......: File (/home/user/OUIs), Left Side Guess.Mod........: Mask (:?1?1:?1?1:?1?1) [9], Right Side Guess.Charset....: -1 ?dABCDEF, -2 Undefined, -3 Undefined, -4 Undefined Speed.#1.........: 581.2 MH/s (7.02ms) @ Accel:8 Loops:128 Thr:512 Vec:1 Recovered........: 1/1 (100.00%) Digests Progress.........: 448010387456/776818655232 (57.67%) Rejected.........: 0/448010387456 (0.00%) Restore.Point....: 0/46302 (0.00%) Restore.Sub.#1...: Salt:0 Amplifier:13672064-13672192 Iteration:0-128 Candidate.Engine.: Device Generator Candidates.#1....: 00:00:00:D4:B4:1C -> 74:88:8B:E9:B4:1C Started: Wed Jul 27 12:51:13 2022 Stopped: Wed Jul 27 13:04:08 2022
It took it 12 minutes to recover the MAC address I hashed using sha256, hashcat estimated a total of 20 minutes to test the whole 39 bits of keyspace on this setup. It turns out no rainbow-tables are necessary, anyone can revert the hashes with a few hundred bucks and negligible amount of electric power.
Running hashcat on a full database would take about 20 minutes, it would just continuously spew out recovered MACs as it iterates through the keyspace and finds matching hashes in the DB. This would make the attack much cheaper per MAC. It must be noted that this works on full hashes, not truncated hashes, for truncated hashes hashcat needs to be extended for this new mode, but that should be a simple exercise for the interested party.
However we need to consider that there might be collisions in the database. Lets examine this.
Collisions are when two different input values both compute to the same output result, Normally with sha256 this is computationally impossible, but when you truncate the hash value, this gets easier and easier. Collisions mean that the ids stored by the city/operator of this system would not able to uniquely identify a user, and thus provide (some negligible) anonymity.
Comparing the ~39.5 bits of input entropy with the 12*4 (= 48) bits of output entropy for the hashes stored for 24 hours means that every MAC has with high probability a unique 48 bit truncated 24 hour hash. However older than 24 hour hashes have only 36 bits, which means that each 36 bit hash maps back to ~ 11.3 MAC addresses. This sounds nice, but there is a few concerns:
- most valid MAC addresses have a OUI that are clearly not phones.
- Even if it is ambiguous which MAC address a hash belongs to, the context like location, time, and possibly other metadata collected, will still uniquely identify a person and by doing so can also recover the MAC address if that is necessary.
But we should also consider the Birthday Paradox, or something
similar. Based on the above, the chance that two random MACs hash to
the same 36 bit hash is about ((11.3 - 1) / 2^39.5). For n random
MAC addresses the probability that they all hash to different hashes
can be seen in the expression to the right.
To have a 50% chance to have one collision we need 232 129 random MAC addresses. If we have only 90 000 MAC addresses there is still a ~ 10% chance for a collision and with 27 000 MACS we still get ~ 1% of a chance for collisions, in the other direction we need 560 000 MACs to have a ~ 99% chance of a collision.
Looking at the probabilities in a vacuum looks like there is a chance for some collisions and those lucky will have some anonymity, although the anonymity set is extremely small. However these hashes don't exist in a vacuum, they come with extra metadata which makes re-identification possible despite collisions.
Summing up all of the above we can conclude that this system essentially replaces a public identifier - the MAC - with a secret and still mostly unique identifier. To quote the GDPR Article 4 point 1:
‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number ...
So hashing does not change anything from the perspective of the GDPR when it comes to the MAC.
This is whole CityTraffic debacle is just yet another example of (Paul) Ohm's law:
No useful database can ever be perfectly anonymous, and as the utility of data increases, the privacy decreases.
By the way some friends (who are more into phones than me) told me that modern smartphones actually randomize their MAC addresses quite rigorously - I know that for Bluetooth they actually do, but I didn't know that they do for WiFi. MAC randomization exists as default by opt-out since 10 years from IPhone 5 upwards, and Android introduced MAC randomization in 2019. So maybe this problem will quickly solve itself, and the surveillance industrial-complex will have to resort to other - even creepier - methods like cameras and facial recognition to track us in cities.