Hashing is a cryptographic process that can be used to validate the authenticity and integrity of various types of input. It is widely used in authentication systems to avoid storing plaintext passwords in databases, but is also used to validate files, documents and other types of data. Incorrect use of hashing functions can lead to serious data breaches, but not using hashing to secure sensitive data in the first place is even worse.
Hashing versus encryption
Hashing is a one-way cryptographic function while encryption is designed to work both ways. Encryption algorithms take input and a secret key and generate a random looking output called a ciphertext. This operation is reversible. Anyone who knows or obtains the secret key can decrypt the ciphertext and read the original input.
Hashing functions are not reversible. The output of a hashing function is a fixed-length string of characters called a hash value, digest or simply a hash. These are not necessarily intended to be kept secret because they cannot be converted back into their original values.
However, one important property of a hashing function is that when hashed, a unique input must always result in the same hash value. If two different inputs can have the same hash value, it is called a collision and, depending how easy it is computationally to find such a collision, the hash function can be considered broken from a security point of view.
Hashing is almost always preferable to encryption when storing passwords inside databases because in the event of a compromise attackers won't get access to the plaintext passwords and there's no reason for the website to ever know the user's plaintext password.
If you've ever received those notices that "our representatives will never ask for your password" from various companies, that's part of the reason why they won't: They have no use for it because they don't have your password. They have a non-reversible cryptographic representation of your password—its hash value.
That said, companies who suffer security breaches often misuse the term “encryption” in their public disclosures and advise customers that their passwords are secure because they were encrypted. This is probably because the general audience is not very familiar with the meaning of hashing, so their PR departments want to avoid confusion.
It makes it hard for outside observers to assess the risks associated with a breach, however, because if the passwords were truly encrypted then the risk is higher than if they were hashed and the next question should be: Was the encryption key also compromised? Cases of encryption being used instead of hashing for passwords do happen.
In 2013, Adobe suffered a security breach that resulted in information from millions of accounts being stolen, including encrypted passwords. Adobe had updated most of its systems to use hashing, but the breached server was a backup one the company planned to de-commission and that stored passwords encrypted with the Triple DES cipher in ECB mode.
While the attackers didn't obtain the decryption key, the use of this cipher in ECB mode is known to leak information, allowing brute-force attacks to recover a significant number of passwords.
"Encryption should only be used in edge cases where it is necessary to be able to obtain the original password," the Open Web Application Security Project (OWASP) said in its recommendations for password storage.
"Some examples of where this might be necessary are: If the application needs to use the password to authenticate against an external legacy system that doesn't support SSO [or] if it is necessary to retrieve individual characters from the password. The ability to decrypt passwords represents a serious security risk, so it should be fully risk assessed. Where possible, an alternative architecture should be used to avoid the need to store passwords in an encrypted form."
How hashing is used in authentication
In authentication systems, when users create a new account and input their chosen password, the application code passes that password through a hashing function and stores the result in the database. When the user wants to authenticate later, the process is repeated and the result is compared to the value from the database. If it's a match, the user provided the right password.
If the user forgets their password, the password recovery process involves validating their identity—usually by proving ownership of the email that was used to create an account by clicking on a unique password reset link sent via email—and then allowing the user to set a new password and therefore a new password hash in the database.
If the password recovery process results in their old password being sent to the user via email or being displayed to them in the browser, then the implementation is insecure and best security practices were not followed.
That said, even if hashing is used, developers can make implementation errors, for example by using a hashing function that is known to be insecure and is vulnerable to brute-force cracking attacks. Examples of such hashing schemes that used to be very popular but have been deprecated are MD5 and SHA-1.
Developed in 1991, MD5 was the de facto hashing function for a long time, even after cryptanalysts showed that it is theoretically insecure.
Unfortunately, MD5 is still widely used today in old applications or by developers who don't understand security. The first partial collision attack was theorised in 1996 and a full collision was demonstrated in 2004. Today, MD5 collisions can be found within seconds on a regular home computer and the algorithm is extremely vulnerable to brute-force attacks.
SHA-1 (Secure Hash Algorithm 1) was designed by the NSA in 1995 and was a recommended NIST standard. The function has been known to be insecure against well-funded attackers with access to cloud computing power since 2005.
In 2017, security researchers from Centrum Wiskunde and Informatica (CWI) in the Netherlands, Nanyang Technological University (NTU) in Singapore and Inria in France working with Google proved a practical collision against SHA-1 by producing two different PDF files with the same SHA-1 signature.
SHA-1 has been deprecated for TLS certificates and other uses, but it's still widely used in older devices and systems for a variety of purposes, including validating file signatures in code repositories, software updates and more.
For password hashing and storage a recent IETF draft recommends using Argon2 (the winner of the 2015 Password Hashing Competition), Bcrypt, Scrypt or PBKDF2. However, there is more to hashing than just the algorithm used. For example, a minimum password length of eight characters is also important because it makes brute-force attacks that rely on dictionary attacks—lists of common passwords from other data breaches—much harder.
Each hash function can also be implemented so that multiple iterations, or passes, of the hashing algorithm is performed for each password. This is also known as the work factor and its goal is to make the result more computationally intensive to crack using brute force methods. While a higher work factor increases security, it also makes each hashing operation more computationally intensive and longer because the algorithm is executed multiple times.
"There is no golden rule for the ideal work factor—it will depend on the performance of the server and the number of users on the application," OWASP said in its recommendations. "Determining the optimal work factor will require experimentation on the specific server(s) used by the application. As a general rule, calculating a hash should take less than one second, although on higher traffic sites it should be significantly less than this."
Salt and pepper
Another best practice for secure password storage is to combine each password with a randomly generated string of characters called a "salt" and then to hash the result. The salt, which should be unique for every user and password, is then stored along with the hash.
Salting passwords makes certain types of attack much harder or impossible to execute. For example, attackers can pre-compute hashes for a very large number of password combinations and then store them in a database known as a rainbow table.
Later when they find a leaked password hash they can just perform a lookup in the database to see if it matches any of the pre-computed hashes. Since salting passwords also changes the resulting hash, such attacks are rendered inefficient.
Salting also prevents attackers from discovering duplicate passwords in a database. Even if two or more users chose the same password, the server generated different salts for them and the resulting hashes will be different. The recommendation is for salts to be at least 16 characters long, which significantly increases the complexity and length of the plaintext strings that need to be cracked using computationally intensive brute force methods.
To add another layer of security, in addition to salts, developers can also combine all passwords with a randomly generated string of at least 32 characters called a pepper. Unlike a salt, which is unique for every password, the pepper is the same for all passwords but should not be stored inside the database. The goal of the pepper is to make it hard for attackers to crack hashes even when they obtain the full database of the application, including the salts.
The pepper can be stored in an application configuration file that is protected with appropriate file system permissions or in a more secure location like a hardware security module (HSM).
"An alternative approach is to hash the passwords as usual and then encrypt the hashes with a symmetrical encryption key before storing them in the database, with the key acting as the pepper," OWASP said. "This avoids some of the issues with the traditional approach to peppering, and it allows for much easier rotation of the pepper if it is believed to be compromised."
Applications that use an insecure or weak hashing algorithm should be migrated to modern hashing functions. One way to do this could be to use the old hashes as the input for the new hashing algorithm, essentially re-hashing the old hashes. However, while this solves the immediate problem, it makes the resulting hashes more vulnerable to cracking than if they were generated directly from the original user input.
Because of this, it's recommended that hashes are regenerated with the new modern algorithm the next time users log in and input their passwords. If the user is not active and doesn't log in for a certain amount of time, their password can be reset and they can be forced to reset the password when they log in the next time.
Finally, the golden rule for all developers when dealing with cryptography: Don't design your own custom algorithms. Cryptography is very hard and the algorithms that are standardised and widely used are usually the result of academic research efforts that are subject to peer review from other cryptographers and cryptanalysts.