Hash functions take a string of any length and produce a string of fixed-length, called a hash, that is indistinguishable from random noise.

Irreversible

Hash functions are considered cryptographic primitives, but because they are ideally irreversible, they are not a form of encryption. In other words, you shouldn’t be able to take a hash and derive its input as you would decrypt a ciphertext. In technical terms, it should be computationally infeasible to derive an input from its hash.

Avalanche Effect

If you take two inputs that are exactly the same, they will each produce the same hash. But if any character is different between the two inputs, a strong hash function will produce completely different hashes as a result.

Example:

Input: "Hello, World"
Hash: "8663bab6d124806b9727f89bb4ab9db4cbcc3862f6bbf22024dfa7212aa4ab7d"

Input: "Hello, World!"
Hash: "c98c24b677eff44860afea6f493bbaec5bb1c4cbb209c6fc2bbb47f66ff2ad31"

Notice that all we did was add an exclamation mark, but the two hashes are completely different; this is due to what is called an avalanche effect. The iterative nature of a strong hash function results in small changes to the input causing large changes to the output.

This leads to many use cases. Suppose you download a large 2GB file from the internet and you are provided with the hash of the file. You can verify that the file was not corrupted or tampered with by computing and comparing the hash of the downloaded file with the hash that you were provided with.

Or say you are trying to create the digital signature of another large file, it’s much easier to use RSA on the hash of that file rather than trying to use RSA on the entire file. I’ll talk more about this later.

Hash Collisions

In practice, hash functions are never perfect. Because we are often compressing a long, variable-length message into a shorter, fixed-length hash, there is always a chance that the hashes of two different messages compute to exactly the same value. We call this a hash collision and for a strong hash function it should be very hard to find, but not impossible.

For this reason, whenever you hear the term hash security, it’s a reference to how likely (or unlikely) we are to find a collision, not whether a collision exists.

Algorithms

The following are some notable hash function algorithms:

Algorithm Output Length Security Notes
MD5 128 Considered broken; too easy to find collisions. Not bad for download verification.
SHA-1 160 Improvement over MD5, but collisions have been found recently so not really trusted.
SHA-2 224, 256, 384, 512 The current standard; bigger/better version of SHA-1. Most common is SHA-256 and SHA-512. Not broken yet but structure is similar to SHA-1 and SHA-1 has known weaknesses; so concerning.
SHA-3 (Keccak) 224, 256, 384, 512, SHAKE128, 256 Completely different function than SHA-2. Plug-in alternative to SHA-2 in case SHA-2 is broken but no need to use yet.
PBKDF2* Varies Takes a hash function like HMAC or SHA-2 and iterates it.
bcrypt* 184 Based on Blowfish cipher; GPUs struggle to crack it because difficult to parallelize.

* used for password storage

Examples

MD5

$ echo 'Hello, World!' | openssl dgst -md5
bea8252ff4e80f41719ea13cdf007273

SHA-256

$ echo 'Hello, World!' | openssl dgst -sha256
c98c24b677eff44860afea6f493bbaec5bb1c4cbb209c6fc2bbb47f66ff2ad31

SHA-512

$ echo 'Hello, World!' | openssl dgst -sha512
921618bc6d9f8059437c5e0397b13f973ab7c7a7b81f0ca31b70bf448fd800a460b67efda0020088bc97bf7d9da97a9e2ce7b20d46e066462ec44cf60284f9a7

Use-case: MACs

Recall that stream ciphers need a way to protect ciphertext from tampering. Hash functions can do just that.

Suppose Alice encrypts a message and wants to send that message to Bob. She can append her encryption key (K), to the end of the ciphertext (C), hash it, and then append that hash to the end of her ciphertext before sending it.

mac

Since Bob has the same encryption key, Bob can parse out that ciphertext, append his own key to it, hash it, and verify that his hash matches Alice’s hash; thereby ensuring the intregrity of Alice’s ciphertext. Another term for a hash in this context is a Message Authentication Code (MAC).

There is, however, a vulnerability using MACs on their own because of the way the SHA family of hash functions work (except SHA-3)1, which is why in practice we opt to use Keyed-Hash Message Authentication Codes (HMACs) instead.

HMACs work roughly the same way as MACs except that we split the encryption key into two so that we can perform two hash functions, one with each key.

Use-case: Digital Signatures

Another use-case is to combine hash functions and public-key cryptography for digital signatures.

Instead of using an entire document for a digital signature, we can just use the hash of that document. We then sign the hash by encrypting it with our private key, and that creates our digital signature.

Later when someone wants to verify our digital signature, they can perform the same hash on the same document. They then decrypt our digital signature with our public verification key. If the two hashes match then our digital signature has been verified.




  1. See wiki article on “Length extension attack” for details.