Download - Chapter 3: [1.5ex] Similarity Preserving Hash Functions · Similarity Preserving Hash Functions Frank Breitinger ... I Similarity digests ... Frank Breitinger Hash Functions in Forensics

Chapter 3:

Similarity Preserving Hash Functions

Frank Breitinger

Hochschule Darmstadt, CASED

SoSe 2012

Frank Breitinger Hash Functions in Forensics / SoSe 2012 1/45

Repetition

Similarity Preserving Hashing

Approaches and their Tools

Applications for Fuzzy Hashing

Peculiarities


Repetition

Repetition




Peculiarities


Repetition

Type: Similarity Preserving Hash Functions [1/2]

I Further property: similar inputs yield similar hash values.I Alignment Robustness.I Non-Propagation.I If possible: Also fulfill expectations for cryptographic hash

functions (partly).

I Field of Application:I Detection similar files during a forensic investigation (e.g.,

different versions of files).I Biometrics: Template protection.I Malware: Detect obfuscated malware (e.g., metamorphic

malware).I Junk mail detection.


Repetition

Type: Similarity Preserving Hash Functions [2/2]

I Possible representatives:I Segment hashes (Tool dcfldd).I Context-triggered piecewise hashes (Tool ssdeep).I Similarity digests (Tool sdhash).

I Example:I ‘I don’t have any fear at home.’ vs

‘I don’t have any bear at home.’I Is there a match?

SHA-1: no match.ssdeep: similarity of 90%.



Repetition




Peculiarities



Problem

Let X ⊆ Σ∗ be the set of inputs, let dx denote a distance functionon X and let x1, x2 ∈ X.

In order to identify the similarity between two inputs x1 and x2 werun into two problems:

1. Both inputs (x1, x2) might be very ‘large’ and thus calculatingdx(x1, x2) is very time consuming.

2. If x1 is known and we want to compare it against x2, we needto have x1 available (disk space problem).



Solution

The aim is to use a compression function that obtain the similarityof the domain.

The Definition in the following is an own creation.



Naming

Several terms for the compression function: similarity preservinghashing, similarity digest, fuzzy hash function or similaritypreserving hash function.

We use the term approach for similarity preserving hashing whichconsists of a

similarity preserving hash function, which is a function / algorithm(which is denoted by h) to build a hash value /fingerprint and a

distance function, (denoted by dy) that outputs a similarity scorefor two hash values / fingerprints. The termfingerprint or hash value is due to cryptographic /traditional hash functions.



Solution: Possible candidates

Let h be called a similarity preserving hash function withh : X −→ Y and let dy denote a distance function on Y . Then apossible solution candidate is a setting of (Y, h, dy) with thefollowing properties:

1. |Y | < |X|.2. h is fast computable.

3. dy is fast computable.

Questions:

I Compare these properties to the ones of cryptographic hashfunctions. What is the difference?

I What is fast?



Valid solution candidates

A solution candidate is valid, if there is a εy:

∀x1, x2 : if dx(x1, x2) ≤ εx,

then dy(h(x1), h(x2)) ≤ εy

This issues is called correctness.

In words:Similar inputs yield similar outputs.



Quality of solution candidates

The quality of a solution candidate can be measured by itscompleteness which is defined as follows:

∀x1, x2 : if dy(h(x1), h(x2)) ≤ εy,

then dx(x1, x2) ≤ εx

In words:Similar outputs imply similar inputs.

Often completeness is a little bit too strict. Thus we introduce theterm p-completeness: Let p be a probability with 0 ≤ p ≤ 1.

∀x1, x2 : if dy(h(x1), h(x2)) ≤ εy,

then P (dx(x1, x2) ≤ εx) ≥ p



Main Goal

1. Overcome ‘drawbacks’ of cryptographic hash functions indifferent application contexts.

2. E.g., for computer forensics the main drawbacks are:I Data acquisition: Integrity of copy is destroyed, if some bits

change.

I White-/Blacklisting:I Suspect files similar to known-to-be-bad-files are not detected.

I Fragments are not detected (due to deletion, fragmentation).



Approaches

Currently known approaches:

I Segment hashes (also called block hashes): Tool dcfldd.

I Context-triggered piecewise hashes: Tool ssdeep.

I Similarity digests: Tool sdhash.



Repetition




Peculiarities



Segment Hashes including dcfldd



Segment Hashes

1. Underlying idea:I Split input data (volume, file) in blocks of fixed length.

I Compute for each segment its cryptographic hash.

2. Original aim: Improve integrity of storage media.



Segment Hashes: Example - dcfldd

$ dcfldd if=/dev/hda1 of=image-hda1.dd bs=512 hashwindow=4096 hash=sha256

0 - 4096: da0bd2b16c7cd5acb5695e9d81fb6d832cba85312d87e08d0c675e41b608de50

4096 - 8192: 281f4b8ac2dcda0f3fd9a0642a694fedf829d7567a531b1cfc8925f94eebe7a3

8192 - 12288: 1c05a3c7251b666c1ec4a2b689e25f95a92a311613ce685fdf7cbf41552290e5

12288 - 16384: 6c9c17f271f18587bcccf8f9c6154b4bff764664a3eb8ddf06881168c5c4698b

16384 - 20480: c61cd658e73450dfb0dfc9a1d83cdbddd162d9194d81f27f0516bb107280e841

[REMOVED]



Segment Hashes: Example from NIST

1. Sample tool of Nicholas Harbour (since 2002):I dcfldd: An extension of dd.

I Department of Defense – Computer Forensics Laboratory.

I Provides MD5, SHA-1, SHA-2 family.

2. Evaluation by NIST (Douglas White, 2008):I Hashing of File Blocks: When Exact Matches are not Useful.

I NIST worked on Windows 2000 and XP OS files.

I Main result:I File-based data reduction leaves an average 30% of disk space

for human investigation.

I Incorporating block hashes reduces this to an average of 15%.

I Assist in recognising wiped media.



Segment Hashes: Evaluation

1. Anti-Blacklisting is very easy:I Introduce an irrelevant byte in the first sector.

I All segment hashes differ from the stored segment hashes.

I Modified suspect file is not detected.

2. A good technique for whitelisting (see NIST results).

3. Size of segment hash database is large:I 4096 byte block size, SHA-1.

I size of hash value in bytessize of raw data in bytes = 20

4096 = 0.00488=⇒ 1 terabyte of raw data yields a 5 gigabyte hash database.

4. Hash database depends on the hashwindow size.



Context Triggered Piecewise Hashes includingssdeep



Context Triggered Piecewise Hashes

1. Underlying idea:I Split input data (volume, file) in blocks of variable length.

I Compute a hash value for each segment.

I The sequence of these segment hashes is the context triggeredpiecewise hash of the input.

2. Question: How do we achieve blocks of variable length?



Context Triggered Piecewise Hashes (CTPH)

The end points of the blocks are determined by a pseudo randomfunction called rolling hash:

I Its value only depends on the current context.

I A window (e.g., of size 7 bytes) slides over the input.

I Context = Bytes of input data in the current window.

I If rolling hash matches a trigger value, an end point is set.



CTPH: A sample tool

1. ssdeep (based on spamsum).

2. Window size is 7.

3. CTPH is a sequence of printable characters:I LS6B are encoded base64.

I Only the least significant 6 bits (LS6B) of a block hash areconsidered.

4. ssdeep decides about a match:I On base of the weighted edit distance (changing, inserting and

deleting characters) of two CTPHs.

I Edit distance is rescaled to a percentage match score.



Piecewise Hashes: The algorithm



CTPH: Sample Research Questions

1. Rolling hash:I Shall be efficient and pseudo random.

I Current implementation is fast, but not pseudo random.

I Task: Find a slightly slower, but pseudo random rolling hash.

2. Fragment detection:I Kornblum’s approach fails, if fragments are much smaller than

the original file.

I Task: Find a different approach.

3. Edit distance:I Kornblum’s approach addresses text files (due to spam

detection).

I Task: Find a more general approach, which also addressesimages, videos, ...



CTPH: Origin & Evaluation

1. Originally proposed for spam detection (spamsum by AndrewTridgell, 2002)

2. Ported to forensics by Jesse Kornblum, 2006: ssdeep.

3. Evaluation by different researchers:I Some publications to improve ssdeep.

I Performance.I Detection rate.I Rolling hash.

I Main conclusion: Fail a security analysis → not usable inforensics.



Similarity Digests including sdhash



Similarity Digests

1. Underlying idea:I Use statistical improbable features from input data (volume,

file) of 64 bytes.

I Generate a cryptographic hash (SHA-1) for each feature.

I Divide each hash value in 5 sub-hashes in order to set five bitswithin a Bloom filter.



Similarity Digests: Example from Roussev

1. Sample tool of Vassil Roussev (since 2010):I sdhash.

I Very popular approach; NIST provides databases.

2. Evaluation by J. Beckingham / F. Breitinger / H. Baier(2012):

I Robust approach with several design errors.

I Performance is slower than ssdeep.I A parallelized version is coming.



Similarity Digests: A sample tool

Preparation:

1. Select statistically improbable features on the basis of theirentropy.

I A ‘feature’ is a (substring-) sequence of 64 consecutive bytes.




Feature selection:

2. Shannon entropy score H is calculated where P (Xi) is theempirical probability (i.e., the relative frequency) ofencountering ASCII code i within a feature.

H = −255∑i=0

P (Xi) · log2 (P (Xi))

3. Roussev uses a precedence rank Rprec. The least likelyfeatures measured by its entropy score gets the lowest rank.




Identification of popular features:

4. A sliding window of a size 64 is going through all Rprec

values. At each position sdhash increments the Rpop score forthe leftmost feature with the lowest Rprec.




Feature selection:




Fingerprint generation:

5. Hash underlying byte sequences for all Rpop scores higher thana certain threshold using SHA-1.

I Split hash (160 bit) into 5 sub-hashes of 32 bits and use the11 least significant bits of each sub hash.

I E.g., 24645aa1 3b9b9d0a b6e2da45 89fcd42d 215de81c

Least significant 11 bits of each sub-hash:24645aa1 = aa1 = 1010 1010 0001

11 bits: 010 1010 0001

6. Depending on these 5 x 11 bits set 5 bits within a Bloom filter(size 256 bytes = 2048 bits = 211 bits).

I A Bloom filter is a bit vector / array.



Bloom filter

I Empty Bloom filter is a bit array of m bits, all set to 0.I k different hash functions are defined (e.g., sub-hashes within

sdhash).I Each hash function maps or hashes some set element to one

of the m array positions with a uniform random distribution.

http://wikipedia.de

I To query for an element (test whether it is in the set), feed itto each of the k hash functions to get k array positions



Repetition




Peculiarities



Applications

1. Forensics (on the file level): Detect similar files.I Blacklisting:

I Detect manipulated suspicious files.

I Find fragments of suspicious data.

I Identify similar versions of files.

I Whitelisting: Find changed unsuspicious files?



Applications

2. Biometrics: Template protection.I Due to privacy only save hash value.I Why do we need similarity preserving hashing and not

cryptographic hashing (e.g., SHA-1)?

3. Malware Detection:I Detect obfuscated malware (e.g., metamorphic malware).

4. Junk mail detection.


Peculiarities

Repetition




Peculiarities


Peculiarities

Database comparison

Let n be the number of hash values entries within a database.

I Comparison time for cryptographic hash functions: O(log2 n).

I Comparison time for similarity preserving hashing: O(n).I Ordering is not possible.

I Fragment detection.


Peculiarities

Drawbacks

1. Compression → size of the hash value.I Fixed size of few bits vs. variable of several thousand bytes

2. Ease of computation → creating a hash value.I Currently cryptographic hash functions are faster.


Peculiarities

Whitelisting

Does it make sense to use non-cryptographic hash functions forwhitelisting?


Questions?

http://www.glasbergen.com/


http://www.glasbergen.com/

Download - Chapter 3: [1.5ex] Similarity Preserving Hash Functions · Similarity Preserving Hash Functions Frank Breitinger ... I Similarity digests ... Frank Breitinger Hash Functions in Forensics

Top Related