By Jon Berryhill
If you’ve encountered a matter involving computer evidence, you may have heard the term “hash value” and wondered what in the world a hash value is. A hash tag “#” (otherwise known as the pound symbol or, originally, an octothorpe), brought to you by Twitter in 2007, is not what this post is about. A hash value and a hash tag are two completely different things. Let’s take a quick dive into this somewhat esoteric term for a critical tool.
A hash value is a common feature used in forensic analysis as well as the cryptographic world. The best definition I’ve seen is that a hash is a function that can be used to map data of an arbitrary size onto data of a fixed size. The word “function” is used in its truest form from mathematics. The hash value is the result of the function. Standard hash algorithms are sets of complex but public mathematical steps. There is nothing secret about them.
Some people equate a hash value to a fingerprint. It provides a way of identifying and verifying a chunk of digital data. You can have a hash value for a single file, groups of files, or even an entire hard drive. A hash value is a harmless looking string of hexadecimal values, generally 32 to 64 characters long, depending on the hash algorithm used. There is absolutely nothing in a hash value that will tell you anything about what was hashed or how big it was. The way the algorithms work, the length of the hash value is always the same no matter the quantity of the data processed.
So what do they look like?
76f5af6dc1a97facc1f830d7a66cfd35 C:\TEMP\file-144727171111L001 (1).pdf
76f5af6dc1a97facc1f830d7a66cfd35 C:\TEMP\file-144727171111L001 (2).pdf
Above are the computed hash values for 7 files. Note that the last 3 files have different names but the hash values match. The content of these 3 files is exactly the same. In this case the hash values were computed with a standard algorithm called MD5 (the “MD” is short for Message Digest, the “5” is a version number).
The same files can be processed with the SHA256 algorithm and the results look like this.
95df48581de075511e44aceb2417a0cc125c593dfbc904fcb9ceaa3fefbd30c5 C:\temp\file-144727171111L001 (1).pdf
95df48581de075511e44aceb2417a0cc125c593dfbc904fcb9ceaa3fefbd30c5 C:\temp\file-144727171111L001 (2).pdf
The hash value has nothing to do with the name of a file and different hash algorithms produce different hash values even when processing the same files. Just a hash value by itself is useless without identifying which hash algorithm was used to create it.
How are hash values used?
In the forensic analysis community, if I provide a copy of a forensic image file set to another examiner, I also provide the hash value associated with it. The other examiner can compute the hash value for what they received and compare that to the provided hash value. If they match, we know that we are both looking at exactly the same thing. If the hash values don’t match, we know that something is different. The hash value provides no clues as to what is different.
In the security and cryptographic community, a system does not store your password. It stores a computed hash value of your password. If someone is trying to break into your account, it is exceedingly complex for someone to come up with a password that results in the same hash value as your password. The hash values of passwords don’t really need the same level of protection as the actual passwords. In real terms you simply cannot reverse engineer a password from a given hash value.
All that being said, some hash algorithms are more secure than others. In a lab setting, the MD5 hash has been “cracked.” It is possible, with a modest amount of computing power, to create two files that are different that result in the same MD5 hash value. This is what is called a hash collision. I know of no instance of a hash collision in the “wild.” That’s not to say the MD5 algorithm is useless. You simply have to understand its appropriate uses and limitations.
One of the common uses of hash values in the forensics and law enforcement communities is in child pornography cases. Law enforcement maintains a database of hash values of known child pornography. This way they can share the hash values without having to share, transport or otherwise handle actual contraband material. An examiner can use tools to search seized evidence for files that have matching hash values. If there is a match the examiner can further examine the highlighted file. The benefit is that an examiner can automate much of the otherwise very tedious and time-consuming process of reviewing what could be millions of pictures or videos on a computer when searching for contraband. It’s not a perfect solution. It can miss contraband items, but it does save a lot of time and resources. There isn’t a danger of someone being arrested for a false positive because no case is made on just a matching hash value. Someone still has to look at any matches and decide if it is a valid hit or not. It’s just a tool.
Similarly, there are hash value sets of known files that can be used to filter out otherwise known or uninteresting files among groups of millions of files, so an examiner can focus on the unique data.
There are many other uses of hash values in both the forensic and cryptographic communities, but these examples should give you an idea of some of what is going on the next time you hear “hash value” in reference to an item of digital evidence.