Breached password checker, Part 1 - Anonymize data with k-anonymity.

Elia Bar
3 min readJul 4, 2021

--

Recently I became more and more interested in password security. I came across a big data set containing a massive amount of breached passwords from known websites.

It gave me an idea for a website that will let you explore some frequently used passwords and test your password’s strength, so I came up with my website passwordCheck.app

The idea was to figure out how to create the best data structure for that purpose with the constraints of high security, low budget, and low latency.

Passwords are a sensitive thing. So I wanted to figure out how to query a password in my DB without giving it away.

I didn’t want to pass the password to the server like so:

Not something you want to do!

I did a little research and ran into this method called k-anonymity. It would let you look up a password from my breached passwords DB without using the original password.
It would let users use the platform freely without the concern of malicious use of their passwords.

So what exactly is this k-anonymity?

k-anonymity is a method to anonymize data.
It ensures subjects of the data cannot be re-identified while the data remain useful.
In our case, we are looking at a more specific method that was largely developed by Junade Ali called Hash-based k-anonymity.

This approach works by taking a cryptographic hash of one-dimensional data and truncating the hash such that there are at least k − 1 hash collisions.

Let’s say that Bob has a password ‘mypass’ and he wants to check if it exists on our leaked passwords DB.

First, we want to convert Bob’s password to a cryptographic hash (SHA-1); In our case, ‘mypass’ turns to:
‘e727d1464ae12436e899a726da5b2f11d8381b26’.

While hashes are uni-directional and can’t translate back to the original string, I can still brute force a bunch of words until I get the same hash and infer the actual password.

Now, we will truncate the hash to have five first characters, for example.
‘e727d1464ae12436e899a726da5b2f11d8381b26’ => ‘e727d’

When the user sends us the hash’s prefix, we will return all stored hashes with the same prefix. It should look like this:

The returned result from the server

After receiving this response from the server, we can compare and check if our password exists in the hashes list and see how many times it was exposed without sending it directly to the server.
In our example ‘e727d1464ae12436e899a726da5b2f11d8381b26 ’(mypass) was exposed 631 times.

Now, how secure is this? It depends on the amount of data I possess in my DB and how many hashes are returned with each query.
For any randomly generated hash prefix in my DB, we will get a response that contains roughly 70 different hashes (collisions) for the same prefix.
So for me to guess, the original password is a ~1/70 chance.
It’s important to note that every hash in that file represents an entirely different password from the other passwords in that file, even if they are starting with the same prefix.

The next part will discuss the data structure, storage, and preparation of the data.

You can visit passwordCheck.app and explore your password strength and if it was previously breached.
Thank you for reading!

--

--