Chapter 3:
Similarity Preserving Hash Functions
Frank Breitinger
Hochschule Darmstadt, CASED
SoSe 2012
Frank Breitinger Hash Functions in Forensics / SoSe 2012 1/45
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 2/45
Repetition
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 3/45
Repetition
Type: Similarity Preserving Hash Functions [1/2]
I Further property: similar inputs yield similar hash values.I Alignment Robustness.I Non-Propagation.I If possible: Also fulfill expectations for cryptographic hash
functions (partly).
I Field of Application:I Detection similar files during a forensic investigation (e.g.,
different versions of files).I Biometrics: Template protection.I Malware: Detect obfuscated malware (e.g., metamorphic
malware).I Junk mail detection.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 4/45
Repetition
Type: Similarity Preserving Hash Functions [2/2]
I Possible representatives:I Segment hashes (Tool dcfldd).I Context-triggered piecewise hashes (Tool ssdeep).I Similarity digests (Tool sdhash).
I Example:I ‘I don’t have any fear at home.’ vs
‘I don’t have any bear at home.’I Is there a match?
SHA-1: no match.ssdeep: similarity of 90%.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 5/45
Similarity Preserving Hashing
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 6/45
Similarity Preserving Hashing
Problem
Let X ⊆ Σ∗ be the set of inputs, let dx denote a distance functionon X and let x1, x2 ∈ X.
In order to identify the similarity between two inputs x1 and x2 werun into two problems:
1. Both inputs (x1, x2) might be very ‘large’ and thus calculatingdx(x1, x2) is very time consuming.
2. If x1 is known and we want to compare it against x2, we needto have x1 available (disk space problem).
Frank Breitinger Hash Functions in Forensics / SoSe 2012 7/45
Similarity Preserving Hashing
Solution
The aim is to use a compression function that obtain the similarityof the domain.
The Definition in the following is an own creation.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 8/45
Similarity Preserving Hashing
Naming
Several terms for the compression function: similarity preservinghashing, similarity digest, fuzzy hash function or similaritypreserving hash function.
We use the term approach for similarity preserving hashing whichconsists of a
similarity preserving hash function, which is a function / algorithm(which is denoted by h) to build a hash value /fingerprint and a
distance function, (denoted by dy) that outputs a similarity scorefor two hash values / fingerprints. The termfingerprint or hash value is due to cryptographic /traditional hash functions.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 9/45
Similarity Preserving Hashing
Solution: Possible candidates
Let h be called a similarity preserving hash function withh : X −→ Y and let dy denote a distance function on Y . Then apossible solution candidate is a setting of (Y, h, dy) with thefollowing properties:
1. |Y | < |X|.2. h is fast computable.
3. dy is fast computable.
Questions:
I Compare these properties to the ones of cryptographic hashfunctions. What is the difference?
I What is fast?
Frank Breitinger Hash Functions in Forensics / SoSe 2012 10/45
Similarity Preserving Hashing
Valid solution candidates
A solution candidate is valid, if there is a εy:
∀x1, x2 : if dx(x1, x2) ≤ εx,
then dy(h(x1), h(x2)) ≤ εy
This issues is called correctness.
In words:Similar inputs yield similar outputs.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 11/45
Similarity Preserving Hashing
Quality of solution candidates
The quality of a solution candidate can be measured by itscompleteness which is defined as follows:
∀x1, x2 : if dy(h(x1), h(x2)) ≤ εy,
then dx(x1, x2) ≤ εx
In words:Similar outputs imply similar inputs.
Often completeness is a little bit too strict. Thus we introduce theterm p-completeness: Let p be a probability with 0 ≤ p ≤ 1.
∀x1, x2 : if dy(h(x1), h(x2)) ≤ εy,
then P (dx(x1, x2) ≤ εx) ≥ p
Frank Breitinger Hash Functions in Forensics / SoSe 2012 12/45
Similarity Preserving Hashing
Main Goal
1. Overcome ‘drawbacks’ of cryptographic hash functions indifferent application contexts.
2. E.g., for computer forensics the main drawbacks are:I Data acquisition: Integrity of copy is destroyed, if some bits
change.
I White-/Blacklisting:I Suspect files similar to known-to-be-bad-files are not detected.
I Fragments are not detected (due to deletion, fragmentation).
Frank Breitinger Hash Functions in Forensics / SoSe 2012 13/45
Similarity Preserving Hashing
Approaches
Currently known approaches:
I Segment hashes (also called block hashes): Tool dcfldd.
I Context-triggered piecewise hashes: Tool ssdeep.
I Similarity digests: Tool sdhash.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 14/45
Approaches and their Tools
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 15/45
Approaches and their Tools
Segment Hashes including dcfldd
Frank Breitinger Hash Functions in Forensics / SoSe 2012 16/45
Approaches and their Tools
Segment Hashes
1. Underlying idea:I Split input data (volume, file) in blocks of fixed length.
I Compute for each segment its cryptographic hash.
2. Original aim: Improve integrity of storage media.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 17/45
Approaches and their Tools
Segment Hashes: Example - dcfldd
$ dcfldd if=/dev/hda1 of=image-hda1.dd bs=512 hashwindow=4096 hash=sha256
0 - 4096: da0bd2b16c7cd5acb5695e9d81fb6d832cba85312d87e08d0c675e41b608de50
4096 - 8192: 281f4b8ac2dcda0f3fd9a0642a694fedf829d7567a531b1cfc8925f94eebe7a3
8192 - 12288: 1c05a3c7251b666c1ec4a2b689e25f95a92a311613ce685fdf7cbf41552290e5
12288 - 16384: 6c9c17f271f18587bcccf8f9c6154b4bff764664a3eb8ddf06881168c5c4698b
16384 - 20480: c61cd658e73450dfb0dfc9a1d83cdbddd162d9194d81f27f0516bb107280e841
[REMOVED]
Frank Breitinger Hash Functions in Forensics / SoSe 2012 18/45
Approaches and their Tools
Segment Hashes: Example from NIST
1. Sample tool of Nicholas Harbour (since 2002):I dcfldd: An extension of dd.
I Department of Defense – Computer Forensics Laboratory.
I Provides MD5, SHA-1, SHA-2 family.
2. Evaluation by NIST (Douglas White, 2008):I Hashing of File Blocks: When Exact Matches are not Useful.
I NIST worked on Windows 2000 and XP OS files.
I Main result:I File-based data reduction leaves an average 30% of disk space
for human investigation.
I Incorporating block hashes reduces this to an average of 15%.
I Assist in recognising wiped media.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 19/45
Approaches and their Tools
Segment Hashes: Evaluation
1. Anti-Blacklisting is very easy:I Introduce an irrelevant byte in the first sector.
I All segment hashes differ from the stored segment hashes.
I Modified suspect file is not detected.
2. A good technique for whitelisting (see NIST results).
3. Size of segment hash database is large:I 4096 byte block size, SHA-1.
I size of hash value in bytessize of raw data in bytes = 20
4096 = 0.00488=⇒ 1 terabyte of raw data yields a 5 gigabyte hash database.
4. Hash database depends on the hashwindow size.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 20/45
Approaches and their Tools
Context Triggered Piecewise Hashes includingssdeep
Frank Breitinger Hash Functions in Forensics / SoSe 2012 21/45
Approaches and their Tools
Context Triggered Piecewise Hashes
1. Underlying idea:I Split input data (volume, file) in blocks of variable length.
I Compute a hash value for each segment.
I The sequence of these segment hashes is the context triggeredpiecewise hash of the input.
2. Question: How do we achieve blocks of variable length?
Frank Breitinger Hash Functions in Forensics / SoSe 2012 22/45
Approaches and their Tools
Context Triggered Piecewise Hashes (CTPH)
The end points of the blocks are determined by a pseudo randomfunction called rolling hash:
I Its value only depends on the current context.
I A window (e.g., of size 7 bytes) slides over the input.
I Context = Bytes of input data in the current window.
I If rolling hash matches a trigger value, an end point is set.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 23/45
Approaches and their Tools
CTPH: A sample tool
1. ssdeep (based on spamsum).
2. Window size is 7.
3. CTPH is a sequence of printable characters:I LS6B are encoded base64.
I Only the least significant 6 bits (LS6B) of a block hash areconsidered.
4. ssdeep decides about a match:I On base of the weighted edit distance (changing, inserting and
deleting characters) of two CTPHs.
I Edit distance is rescaled to a percentage match score.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 24/45
Approaches and their Tools
Piecewise Hashes: The algorithm
Frank Breitinger Hash Functions in Forensics / SoSe 2012 25/45
Approaches and their Tools
Piecewise Hashes: The algorithm
Frank Breitinger Hash Functions in Forensics / SoSe 2012 26/45
Approaches and their Tools
CTPH: Sample Research Questions
1. Rolling hash:I Shall be efficient and pseudo random.
I Current implementation is fast, but not pseudo random.
I Task: Find a slightly slower, but pseudo random rolling hash.
2. Fragment detection:I Kornblum’s approach fails, if fragments are much smaller than
the original file.
I Task: Find a different approach.
3. Edit distance:I Kornblum’s approach addresses text files (due to spam
detection).
I Task: Find a more general approach, which also addressesimages, videos, ...
Frank Breitinger Hash Functions in Forensics / SoSe 2012 27/45
Approaches and their Tools
CTPH: Origin & Evaluation
1. Originally proposed for spam detection (spamsum by AndrewTridgell, 2002)
2. Ported to forensics by Jesse Kornblum, 2006: ssdeep.
3. Evaluation by different researchers:I Some publications to improve ssdeep.
I Performance.I Detection rate.I Rolling hash.
I Main conclusion: Fail a security analysis → not usable inforensics.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 28/45
Approaches and their Tools
Similarity Digests including sdhash
Frank Breitinger Hash Functions in Forensics / SoSe 2012 29/45
Approaches and their Tools
Similarity Digests
1. Underlying idea:I Use statistical improbable features from input data (volume,
file) of 64 bytes.
I Generate a cryptographic hash (SHA-1) for each feature.
I Divide each hash value in 5 sub-hashes in order to set five bitswithin a Bloom filter.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 30/45
Approaches and their Tools
Similarity Digests: Example from Roussev
1. Sample tool of Vassil Roussev (since 2010):I sdhash.
I Very popular approach; NIST provides databases.
2. Evaluation by J. Beckingham / F. Breitinger / H. Baier(2012):
I Robust approach with several design errors.
I Performance is slower than ssdeep.I A parallelized version is coming.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 31/45
Approaches and their Tools
Similarity Digests: A sample tool
Preparation:
1. Select statistically improbable features on the basis of theirentropy.
I A ‘feature’ is a (substring-) sequence of 64 consecutive bytes.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 32/45
Approaches and their Tools
Similarity Digests: A sample tool
Feature selection:
2. Shannon entropy score H is calculated where P (Xi) is theempirical probability (i.e., the relative frequency) ofencountering ASCII code i within a feature.
H = −255∑i=0
P (Xi) · log2 (P (Xi))
3. Roussev uses a precedence rank Rprec. The least likelyfeatures measured by its entropy score gets the lowest rank.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 33/45
Approaches and their Tools
Similarity Digests: A sample tool
Identification of popular features:
4. A sliding window of a size 64 is going through all Rprec
values. At each position sdhash increments the Rpop score forthe leftmost feature with the lowest Rprec.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 34/45
Approaches and their Tools
Similarity Digests: A sample tool
Feature selection:
Frank Breitinger Hash Functions in Forensics / SoSe 2012 35/45
Approaches and their Tools
Similarity Digests: A sample tool
Fingerprint generation:
5. Hash underlying byte sequences for all Rpop scores higher thana certain threshold using SHA-1.
I Split hash (160 bit) into 5 sub-hashes of 32 bits and use the11 least significant bits of each sub hash.
I E.g., 24645aa1 3b9b9d0a b6e2da45 89fcd42d 215de81c
Least significant 11 bits of each sub-hash:24645aa1 = aa1 = 1010 1010 0001
11 bits: 010 1010 0001
6. Depending on these 5 x 11 bits set 5 bits within a Bloom filter(size 256 bytes = 2048 bits = 211 bits).
I A Bloom filter is a bit vector / array.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 36/45
Approaches and their Tools
Bloom filter
I Empty Bloom filter is a bit array of m bits, all set to 0.I k different hash functions are defined (e.g., sub-hashes within
sdhash).I Each hash function maps or hashes some set element to one
of the m array positions with a uniform random distribution.
http://wikipedia.de
I To query for an element (test whether it is in the set), feed itto each of the k hash functions to get k array positions
Frank Breitinger Hash Functions in Forensics / SoSe 2012 37/45
Applications for Fuzzy Hashing
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 38/45
Applications for Fuzzy Hashing
Applications
1. Forensics (on the file level): Detect similar files.I Blacklisting:
I Detect manipulated suspicious files.
I Find fragments of suspicious data.
I Identify similar versions of files.
I Whitelisting: Find changed unsuspicious files?
Frank Breitinger Hash Functions in Forensics / SoSe 2012 39/45
Applications for Fuzzy Hashing
Applications
2. Biometrics: Template protection.I Due to privacy only save hash value.I Why do we need similarity preserving hashing and not
cryptographic hashing (e.g., SHA-1)?
3. Malware Detection:I Detect obfuscated malware (e.g., metamorphic malware).
4. Junk mail detection.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 40/45
Peculiarities
Repetition
Similarity Preserving Hashing
Approaches and their Tools
Applications for Fuzzy Hashing
Peculiarities
Frank Breitinger Hash Functions in Forensics / SoSe 2012 41/45
Peculiarities
Database comparison
Let n be the number of hash values entries within a database.
I Comparison time for cryptographic hash functions: O(log2 n).
I Comparison time for similarity preserving hashing: O(n).I Ordering is not possible.
I Fragment detection.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 42/45
Peculiarities
Drawbacks
1. Compression → size of the hash value.I Fixed size of few bits vs. variable of several thousand bytes
2. Ease of computation → creating a hash value.I Currently cryptographic hash functions are faster.
Frank Breitinger Hash Functions in Forensics / SoSe 2012 43/45
Peculiarities
Whitelisting
Does it make sense to use non-cryptographic hash functions forwhitelisting?
Frank Breitinger Hash Functions in Forensics / SoSe 2012 44/45
Questions?
http://www.glasbergen.com/
Frank Breitinger Hash Functions in Forensics / SoSe 2012 45/45