claudiucuciurean.com
Projects
Blog
Language EN
What are hashes and why does nobody know your password?
17th October 2021

In Computer Science there are two terms for which the meaning is often confused by non-tech people. “Encryption” and “Hashing". I’ve often heard people being concerned about the privacy of their password: What stops a developer working at X company to just look at my password in the database and use my account?

First of all, lets understand the concept of “authentication”. When you first register on any website, you have to provide a set of credentials thatuniquely identify you. It can be a username or more popularly an email, combined with a password. Emails are usually preferred because if there is an “account verification” policy in place (so a mail is being sent to the provided email address to verify it), its much easier to prevent multiple account creation by the same person. Besides these “public” identifiers, one must provide some sort of private identifier to differentiate them from other users. Traditionally this was in the form of a password, but nowadays there are other methods that can be used as well, such as scanning QR codes (in the case of Whatsapp for example). The purpose of this article is to demystify what happens in the first scenario.

Right, so an email and a password. Basically, when a user registers he states “This is me: the email, and this is what I will use in the future to prove it: the password ”. But this data needs to be stored by the service provider (the website), so the question raised is: How is it possible for the website to store my password but in the same time prevent them/or anybody else (in case they get hacked) from ever knowing it?

Well the answer is: hashes. Any self-respecting website that allows account creation will be using some sort of hashing for storing the password. Hashes are in a way “fingerprints” of the initial information. They are able (although not perfectly) to uniquely identify data (the password), while in the same time be irreversible, meaning that you cannot obtain the same data back just by using the hash. Let’s take an example:

The hash (in SHA512) for

“supersecurepassword”

is

“a321c772fad01dcf739b775e724f64e0e6ae4fbd94dfbb1b5b670ef5c9a65e5204be45227e3d41fd3d8a810967a3925edfdef4d7bfea17118a70ba4be73e1698”

The second string is what a database would hold for a user with that password, if the hashing method were SHA512. There are multiple methods of hashing, some more secure than others. We'll talk about this a bit later, but for now, we must understand that generally, a hash tries to achieve the following things:

  1. It should be able to provide uniqueness to a large variety of inputs (otherwise you could login with a different password than the one provided)
  2. It has to be fast enough so the authentication process doesn’t take too long
  3. It shouldn’t contain traces of the initial information (in our case, “supersecurepassword”, if we reverse engineer the hash, we shouldn’t be able to tell if the password contained a “s” or a “d” or anything else)
  4. It shouldn’t look similar for similar inputs
  5. It should have the same length for different input length sizes.

Apart from the first criterion, hashing systems generally respect these principles. However, the first one is impossible to achieve. And we can prove that by doing some logical math. Let’s say the length of a given hash (aka how many digits and letters it has) is x. How many different hashes exist in this case?

Well, considering the characters used are all the letters (26 in the English alphabet) and all the digits (10 arabic numerals), we have 36 possible characters. The number of possible unique hashes is 36x (any character from the x can be any character from the 36 possible ones => we multiply 36 a x amount of times). Now, lets assume that all the users will be using passwords with the length of x+1. This would mean that the number of possible unique passwords will be 36x+1 which is bigger than 36x. Hence, invariable of the hashing method used, there will always be collisions (the term used for when two different inputs produce the same hash), as we cannot map a bigger number of possible inputs to the possible number of hashes. The important thing is that it never happens with real-life passwords, and given that most of the hashes are at least 32 characters long (and those are the less secure ones), we can rest assured that they exceed the average password, meaning that collisions should be highly unlikely. It's also worth noting that real-world hashes aren't using all the letters.

How is the “irreversible” part being achieved though? Although the mechanics behind the hashing algorithms are more complex than this, I’ll try to explain in simpler analogical terms.

Let’s say you have a number : 10. This number will go through a number of preset operations until a new number will be produced. These preset operations are the same for any number provided, but the result should ( for a reasonably good hash ) be as close to unique as possible (we proved earlier that its not truly possible).

So for education purposes, let’s say the preset operations are the following:

If we do this with our number, 10, we get the following:

10 * 23 = 230 230 / 11 = 20 , and remainder 10

In computer science language we write % (pronounced “modulo”) when we want to take the remainder of a division:

230 % 11 = 10

Finally:

(10 – 5) * 17 = 85

So for our number 10, the result of the pseudo-hashing algorithm would be 85. If we quickly do the same for the consecutive number, 11 we would get -85, while for 9 we would get 68. Although at first sight they might look pretty different, a pattern would easily emerge if one would do this for more numbers. On top of this, if we would try this method on the number 43, we would be surprised to see that the result would also be 85. For this pseudo-hash, the input 10 would collide with 43, as they would produce the same result.

This was just a very simplified version for explaining what the hash algorithms are actually doing. In our example, by using the operation of modulo %, we were able to delete the trace of the initial data, because multiple numbers can have a remainder of 10, if they are divided by 11, so there is no way of telling which number we started with. This is the key for understanding how hashes work, and why its not mathematically possible to obtain the input solely by using the hash. The actual real-world hashes are mainly using the operation of bitwise right shift ( >> ) for losing the trace to the initial data. Just imagine any input as being a number. The “Bitwise” part means the operation is being done on the binary form of the number.

If we would be to represent 10 as a binary number we we would get 1010:

10 = 21 + 23

1 0 1 0 // 10 in binary 3 2 1 0 // the powers of 2 for each digit

Reading 1010 from left to right, we could say:10 is equal to the sum of one 23, zero 22, one 21 and zero 20

11 represented as binary is

1011

because its equal to 10 (so 1010) + 20, hence why we need a 1 on the 0 position.

When we “bitwise right shift” a number, we push the binary number to the right, past an imaginary vertical line. All the digits that pass the line are “deleted”, and the remaining number is what we are left with. If we were to bitwise right shift 10 by 2 in our case:

1010 >> 2 = 10 | 10

which is 2 in binary. For 11, the same operation:

1011 >> 2 = 10 | 11

which is also 2 in binary.

So doing the same operation on 10 or 11, we get the same result, which is great. It means that we managed to successfully “lose our track”, and even if the algorithm would be public (and they are) we wouldn't be able to reverse engineer and obtain 10 using 2, as we wouldn’t know what digits were deleted. This type of operation (of deleting the trace) is done intermittently with operations of adding trash data to the input and altering it until the point in which no data from the initial input is left. The final result will be a string of gibberish that doesn’t hold any useful information, and just acts as a fingerprint for the input. But if the result will be a number, why aren’t just digits in the final hash?

The digits are being divided into blocks of hexadecimal numbers. Hexadecimal numbers can contain letters from A up to F. Have you noticed that the hash for the “supersecretpassword” only contained letters up to F? In reality it’s just a huge number.

Now let’s talk about the different types of hashes and why they are important. There are a variety of hashes to choose from, when you are a developer, most known being: MD5, SHA1, SHA256, SHA512. In a future article I’ll explain them in more depth. They all have upsides and downsides. The hashes vary in length and also in the algorithm. The algorithm part isn’t that crucial, what’s important is the length part. If we produce the MD5 hash and the SHA512 hash for the same password, “supersecretpassword”, we would get:

// MD5 hash bbb2c5e63d2ef893106fdd0d797aa97a // SHA512 hash 6fcb5c284590f67af12334cf27f94a6dc5fb2f27627b9ba8dc20c210df3edd7a596cd3c9961a5c36bfd8e57a9ae15e6859559f8e11c3059704859cabb59d8340

If you were to count, you would see that the first one has 32 characters, while the second one has 128 characters. The name SHA512 is suggestive for how many “bits” (a bit is a 1 or a 0) the binary number contains. 512 bits can be transformed into a hexadecimal number with 128 characters => the reason why the hash has 128 characters. Any 4 bits can be represented by a single digit hexadecimal number. As a general rule of thumb, from a developer’s perspective, the longer the hash:

What do you mean secure? I thought you can’t figure out my password just by using the hash?. Well, I stated that you can’t figure out the password just by using solely the hash. If however you have a “rainbow table”, and the hash of the password, it is possible. A rainbow table represents a collection of hashes. If one would dispose of sufficient computing power and storage space, they could make a list of hashes for all the possible combinations of characters up to a certain length. That’s why websites usually force the users to have password with a minimum size. Some time ago, MD5 was widely used, but as technology evolved and computers became more performant, “rainbow tables” could cover lengthier passwords. Also, a very important rule when creating your password is to diversify the character pool. You should have numbers and special characters as well, not just letters.

Using https://www.md5hashgenerator.com/ we can get the SHA1 hash for “supersecretpassword” and also the same hash for “supers3cretpassword”. Although they have the same number of characters, notice, that the second password contains a number instead of a letter. The two SHA1 hashes are:

// supersecretpassword 18960546905b75c869e7de63961dc185f9a0a7c9 // supers3cretpassword 897c382c820dc5f1d75468c63b5ed7946587e55c

Now, using these hashes, if we go to https://crackstation.net/ we can see that it easily figures the password for the first hash, while the second hash is not yet known (at the moment of writing this. It’s probable in the future it will be mapped as well).

In conclusion, the only thing you can do as a user to keep your password safe is to come up with a reasonably long password, while in the same time use digits and special characters besides letters. This way, even if the hash of your password gets leaked somehow, you should be safe, as “rainbow tables” won't figure out your password.