delta compression of executable code - analysis, implementation and application-specific...

Delta Compression of Executable CodeAnalysis, Implementation and Application-Specific

Improvements

Lothar May

Master ThesisInformation Technology

Supervisors:Prof. Dr. André NeubauerProf. Dr. Michael Tüxen

19 November 2008 (rev2)

Contents

1 Introduction 11.1 Security Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Matching With Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Analysis 62.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Intuitive Substring Matching . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Run Time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Derivation of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Restrictions of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Optimisation using the FFT . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 Projecting onto Subspaces . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.5 Optimising Random Behaviour . . . . . . . . . . . . . . . . . . . . . 28

2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 Reference Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.2 The First Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.3 The Second Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.4 The Third Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Implementation 513.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.1 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.3 Library Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

i

CONTENTS CONTENTS

3.2.2 Vector Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.3 Selection of Primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.4 Core Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.2.5 File Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.6 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Further Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Improvements 574.1 Theorem 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Selecting Primes Randomly . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 Numerical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Derandomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.1 Selecting Primes Non-Randomly . . . . . . . . . . . . . . . . . . . . . 604.2.2 Creating φ Non-Randomly . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion 62

Appendix 63

Bibliography 67

ii

List of Figures

1 Definition of example strings S and T . . . . . . . . . . . . . . . . . . . . . . 7

2 Plot of match count calculated from example strings S and T . . . . . . . . . . 8

3 Matlab function to calculate the match count vector V . . . . . . . . . . . . . . 8

4 Matlab function to find positions with at least 50% matches . . . . . . . . . . . 9

5 Matlab function to construct strings S and T according to the formal model . . 11

6 Construction example: At position 6 in S the string T randomly matches. . . . . 12

7 Plot of the probability of a randomly good match not in X . . . . . . . . . . . . 13

8 Matlab function to calculate matches using the cyclic correlation . . . . . . . . 15

9 Matlab code to calculate the cyclic correlation using the FFT . . . . . . . . . . 16

10 Plot of the cyclic correlation C calculated from example strings S and T . . . . 16

11 Matlab calls to construct data according to the model (|X |= 1) . . . . . . . . . 17

12 Plot of the cyclic correlation C of model data (|X |= 1) . . . . . . . . . . . . . 18

13 Matlab function to project data before calculating the cyclic correlation . . . . 20

14 Plot of the cyclic correlation C(1) of projected model data (|X |= 1) . . . . . . . 21


16 Chinese Remainder Theorem: Matlab function which calculates the solution . 23

17 Matlab calls to construct data according to the model (|X |= 3) . . . . . . . . . 24

18 Plot of the cyclic correlation C of model data (|X |= 3) . . . . . . . . . . . . . 24



21 Matlab function to randomly select primes for the reconstruction of X . . . . . 27

22 Plot of the cyclic correlation C of model data (|X |= 3, |Σ|= 65536) . . . . . . 28

23 Construction example: At position 6 in S a non-existing match of T is found . . 28

24 Construction according to figure 23 with a different φ . . . . . . . . . . . . . . 29

25 Matlab function to project data with varying φi(x) . . . . . . . . . . . . . . . . 30

26 Simple Matlab function to perform “matching with mismatches” . . . . . . . . 31

27 Matlab function to check input values according to the model . . . . . . . . . . 31

28 Matlab code which demonstrates the usage of our reference algorithm . . . . . 32

29 Matlab function implementing “Algorithm 1.1” of [1] . . . . . . . . . . . . . . 34

30 Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0) . . . 35

31 Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=1) . . . 35

32 Limits of m in algorithm 1.1 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 36

iii

LIST OF FIGURES LIST OF FIGURES

33 Matlab calls to construct data using primes around n10 . . . . . . . . . . . . . . 37

34 Plot of the cyclic correlation C(1) of data using primes around n10 . . . . . . . . 38

35 Plot of the cyclic correlation C(2) of data using primes around n10 . . . . . . . . 39

36 Matlab function implementing “Algorithm 1.2” of [1] . . . . . . . . . . . . . . 4137 Matlab code which tries to use algorithm 1.2 but fails (pedantic=0) . . . . . . . 4238 Matlab code which demonstrates the usage of algorithm 1.2 (pedantic=1) . . . 4239 Limit of m in algorithm 1.2 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 4340 Plot of y =− log(

√δ + e−x) with δ = 2 . . . . . . . . . . . . . . . . . . . . . 45

41 Matlab function to calculate vector D . . . . . . . . . . . . . . . . . . . . . . 4542 Plot of the raw cyclic correlation C . . . . . . . . . . . . . . . . . . . . . . . . 4643 Plot of the filtered cyclic correlation D . . . . . . . . . . . . . . . . . . . . . . 4644 Matlab function implementing “Algorithm 1.3” of [1] . . . . . . . . . . . . . . 4845 Matlab code which demonstrates the usage of algorithm 1.3 (pedantic=1) . . . 4946 Limit on m in algorithm 1.3 (ε = 0.1, p = 0.9, t = 2) . . . . . . . . . . . . . . 4947 Matlab function to reconstruct X as in theorem 1.1 of [1] . . . . . . . . . . . . 5848 Matlab function to provide numeric probability on a good match not in X . . . 6349 Helper function to add up good matches not in X . . . . . . . . . . . . . . . . 6350 Helper function to count good matches not in X . . . . . . . . . . . . . . . . . 6451 Matlab function to compute C as in figure 8 by using the FFT as in figure 9 . . 6452 Matlab function to simply calculate the positions of the t largest values . . . . . 6553 Matlab function to iteratively calculate the Cartesian product (2 dimensions) . . 6554 Matlab function to recursively calculate the Cartesian product . . . . . . . . . 65

iv

List of Tables

1 Simple and slow substring matching . . . . . . . . . . . . . . . . . . . . . . . 72 Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000} . . 263 Test results using n = 102400, m = 512, p = 0.9, X = {34,39411,101410} . . 504 Base libraries for the C++ implementation . . . . . . . . . . . . . . . . . . . . 525 Core modules of the C++ implementation of the algorithm . . . . . . . . . . . 536 Additional modules needed for delta compression . . . . . . . . . . . . . . . . 537 Elements of the pre-calculated index of the implementation . . . . . . . . . . . 548 Numeric tests on probability of reconstructing X . . . . . . . . . . . . . . . . . 58

v

Chapter 1

Introduction

1.1 Security Patches

Regardless of which operating system (OS) is used, security patches need to be applied fre-quently. This lays a heavy burden on OS vendors: They need to provide the patches quicklyon servers with sufficient bandwidth1. The users, on the other side, often need a lot of patiencewhen downloading the patches. If they do not have this patience, their systems might end upbeing vulnerable to known exploits.

The trigger for a security update is often as simple as changing a few lines of source code,for example to prevent a buffer overflow. Changing these few lines can have an enormouseffect, however, if a large executable file needs to be replaced by a patched version. This canlead to a security patch of several megabytes, which is then downloaded by thousands or evenmillions of people. Not only does this cause huge bandwidth costs — the OS vendor alsoneeds to provide update servers, network devices and manpower to administrate them. Evenworse are accumulated patches, for example “Service Packs”, which easily surpass a size of100 megabytes. Deploying these patches is costly and quite a challenge.

The solution not to release patches at all to save costs is not an option, because nowa-days most computers2 are connected to the Internet. Known vulnerabilities need to be fixed asquickly as possible to provide basic security. In larger companies there are usually dedicatedservers, which maintain a cache of updates for employees. This speeds up the process and savesbandwidth costs, but it is not a general solution. Getting back to the root of the problem, thesize of the patches needs to be reduced.

One way to achieve this is to use many small shared libraries instead of big executable files.If the security fix is local, only a small library needs to be replaced. However, it is not alwaysfeasible to instruct the programmers how to write their programs, especially if development isnot managed centrally. Also, sometimes there are good reasons not to use shared libraries. Not

1Even though not entirely correct, “bandwidth” is used as a synonym for data rate, like often done in literatureon computer science.

2and also many embedded systems

1

1.2. MATCHING WITH MISMATCHES CHAPTER 1. INTRODUCTION

to mention that some security fixes cause changes in multiple files, and this will again result inlarge patches if the new files are copied.

A totally different approach is to consider the existing files on the system, and comparethem to the corresponding files containing security fixes, in order to avoid replacing entire filesby new ones. Since these fixes usually implement only few changes, most of the data is alreadypresent on the system — within the files which need patching. If this method is applied, and onlythe differences of the files are stored in security patches, they could be tremendously smaller.Actually, some OS vendors recently have started using this technique to deploy at least some oftheir updates:

• Microsoft Windows Update on Windows XP and above supports “Binary Delta Compres-sion (BDC)” (see [15]), but this is a proprietary system and little information is availableon its inner workings.

• FreeBSD and Mac OS X both come with the open source tool “bsdiff 4” (see [2]) andprovide update tools which (can) make use of it.

Colin Percival, the author of bsdiff, has written a doctoral thesis [1] on this subject, in which hepresents an algorithm to further improve bsdiff 4. Unfortunately, he did not publish the sourcecode for his new algorithm, and there is only few third party material available on it. A referenceto bsdiff 6 can be found in [5] (on page 939) but it does not provide a description. The authorsof [4] use bsdiff 6 for comparison with their patch tool, but again no details are mentioned. Ifwe are willing to make use of the new algorithm, we need to work through the doctoral thesis.

1.2 Matching With Mismatches

The present master thesis is mainly about the aforementioned doctoral thesis “Matching withMismatches and Assorted Applications” [1]. Of that thesis, the focus is on the first chapter,which specifies an algorithm for matching with mismatches in three iterations.

What is “matching with mismatches”? It basically means finding something “similar” towhat is present, where “similar” implies that it can be exactly the same or might be modifiedat some places. These possible modifications are the mismatches. Insertions (additional datain between) or deletions (removed data) are not considered for the measurement of the “simi-larity”, only in-place changes.3 This is related to the Hamming distance (see e.g. [14] on page19): Given a large string S and a small string T , we look for substrings4 in S with low Hammingdistance to T .

Why is this useful for delta compression of executable code? Our aim is to encode only thedifferences of two executable files, specifically the original file and a different version includinga security fix. Security fixes usually implement only few source code modifications, and thus,

3Other algorithms, which consider insertions and deletions, are described in [3].4meaning “continuous parts” of the string S with the same length as T

2

CHAPTER 1. INTRODUCTION 1.3. STRUCTURE

very few code is actually added or removed. However, due to these modifications, memoryaddresses throughout the executable file change.5 In this context, an algorithm for matchingwith mismatches helps to find similar “blocks” of the original file in the new file, such that thedifferences can be identified and encoded. Blocks or parts of blocks which are identical in bothfiles can be referenced and do not need to be copied.

1.3 Structure

This thesis is structured as follows:

1. Introduction

In the course of this chapter, conventions and definitions are provided which are usedthroughout the thesis.

2. Analysis

The analysis contains a step-by-step derivation of the algorithm for matching with mis-matches. This includes the motivation for each step as well as numerical examples. Addi-tionally, example code for the different iterations of the algorithms is provided, and basictests are performed.

3. Implementation

Based on the prior analysis, an implementation of the algorithm using C++ is presented,and possible use in a tool for delta compression is prepared. This chapter describes thebasic structure of the implementation and additional considerations like portability andthird party libraries.

4. Improvements

With regard to the analysis and the C++ implementation, specific improvements of thealgorithm are proposed and illustrated.

5. Conclusion

In this chapter, we provide an overview of what we have achieved and give an outlookidentifying possible further projects.

1.4 Conventions

All examples are written in Matlab [19] code. However, neither knowledge of Matlab nor alicense of Matlab is required to understand this thesis and test the examples. Basic knowledgeof C should be sufficient to be able to read the code, and GNU Octave [20] (with the additionalpackages from Octave-Forge [21]) can be used to run the examples.

5see also [1] on page 32

3

1.5. DEFINITIONS CHAPTER 1. INTRODUCTION

Nevertheless, it should be noted that array and vector indexing in Matlab code starts at 1 (i.e.array{1} is the first element). This is opposed to C, where indexing starts at 0 (i.e. array[0]). Inspite of this, we still stick to the constraints of all values as described in [1], which are mainlyzero-based. So whenever there is an unexplained increment or decrement in the code, the reasonis almost certainly the difference in indexing.

Whenever possible, we use the symbols of [1] (on pages iii-iv). This means that the readermay generally switch to and from the doctoral thesis without problems. One notable differencethough is the use of the vector indices i and j: In [1], the index i is first used as index for thematch count vector (Vi), but later i represents the “vector number” and j becomes the index (e.g.A(i)

j ). For the sake of clarity, we use j as vector index from the start. Additionally, when weare talking about “the vectors” A(i), B(i), or C(i), we actually mean A(i), B(i), respectively C(i)

∀i ∈ {1, . . . , k}, with k being described in the context.Some concepts of probability theory are applied in this thesis without further explanation,

for more information on this subject we refer to [8], especially chapter 3.

1.5 Definitions

A programmer usually regards executable code as binary data. A string, in contrast to that, isseen as human-readable data, which is binary data with special semantics. This distinction isgenerally applied because there are certain functions which do not work with all forms of binarydata. Yet in the context of this thesis it is not relevant. We therefore take the mathematical pointof view (see also [9] on pages 28-29):

Definition (Alphabet)

An alphabet Σ is any finite non-empty set.

Definition (String)

A string over an alphabet Σ is a finite sequence of elements from the alphabet.

These definitions will be used throughout the thesis, so whenever we are talking about astring, we do not impose any semantics on its elements. To give an example, an executable file(analysed at byte level) is a string over the alphabet Σexe = {0, 1, . . . , 255}.

4

CHAPTER 1. INTRODUCTION 1.5. DEFINITIONS

Additionally, we provide some definitions of mathematical terms used in this thesis:

Definition (Ceiling)

d e : R→ Z is the ceiling function which “rounds towards +∞”, i.e. dxe returns the smallestn ∈ Z which is not less than x. The corresponding Matlab function is “ceil”.

Definition (Addition of Sets)

Given A = {a1, a2, . . . , an} ⊆ Z, B = {b1, b2, . . . , bm} ⊆ Z, we define

A+B ={

ai +b j | i ∈ {1, 2, . . . , |A|} , j ∈ {1, 2, . . . , |B|}}

Definition (Interval)

[a, b) = {n ∈ N | a ≤ n < b} for a, b ∈ R. Note that this is not a common definition, becausewe only allow positive integers as elements of the interval.

5

Chapter 2

Analysis

2.1 Approach

In his doctoral thesis [1], Colin Percival writes that the first chapter, in which he introduces thenew algorithm for matching with mismatches, “is not for the faint of heart”6. This is quite true,the mathematics for the algorithm might seem daunting. We try to ease the pain a bit. However,even if we have applications in mind, a good understanding of the underlying algorithm isessential.

In this chapter we provide a description of the algorithm for matching with mismatches with

• detailed “common sense” motivation and reasoning,

• various numerical examples, including numerical evidence and restrictions.

Our approach is to numerically show how and why the algorithm works. We regard this as areasonable addition to chapter 1 of [1], which mainly consists of lemmas/theorems and proofs.

2.2 Motivation

2.2.1 Intuitive Substring Matching

Substring matching is used very frequently in everyday life. For example the “Find...” functionof any word processor performs a substring search in a larger string. This search is usuallydesigned to find only exact matches7. Whenever a single element does not match, it means thatthe whole substring does not match.

Reflecting that we are willing to encode the differences of two files, exact matches wouldhelp but are too strict to be used in general (see also [1] on page 33). It is more useful to also findsections which mostly match, with some mismatches, where the mismatches could for examplebe modified memory addresses.

6See [1] on page 3.7maybe being case insensitive

6

CHAPTER 2. ANALYSIS 2.2. MOTIVATION

Intuitively, this can be done as follows: We iterate through the large string and compare thesmall string with the substring at the current position. For each substring comparison, we countthe number of elements which match, and then process these match counts to find the “good”matches.8 As an example, consider the strings9 S, T and their lengths n, m in figure 1.

S = 'do not tarry water carry';T = 'carry';n = length(S); % = 24m = length(T); % = 5

Figure 1: Definition of example strings S and T

Performing a substring match of T in S using the intuitive algorithm we described aboveleads to the result shown in table 1 with a plot as in figure 2.

Table 1: Simple and slow substring matching

Position ( j) View Match Count (Vj)do not tarry water carry

0 carry 01 carry 02 carry 03 carry 04 carry 05 carry 06 carry 17 carry 48 carry 19 carry 0

10 carry 011 carry 012 carry 013 carry 114 carry 115 carry 116 carry 017 carry 018 carry 119 carry 5

In the language of mathematics, the match counts can be seen as a vector, and the calculationis formally done as follows (see [1] on page 6):

Vj =m−1

∑i=0

δ (Si+ j, Ti) ∀ j ∈ {0 . . . n−m} (2.1)

8“Good” matches are matches with a high number of matching characters, compared to the maximum possiblematch count.

9The longer string was taken from J. W. Goethe’s ”The Sorcerer’s Apprentice”.

7

2.2. MOTIVATION CHAPTER 2. ANALYSIS

Figure 2: Plot of match count calculated from example strings S and T

The function δ : Σ×Σ→ R is in our case the Kronecker delta, i.e.:

δ (a, b) =

{1 if a = b

0 otherwise∀a, b ∈ Σ (2.2)

Translating equation (2.1) to Matlab code is fairly straightforward (see figure 3), except thatwe should not forget to check the input constraint.

function [V] = match(S, T)n = length(S);m = length(T);% Check matching predicate.if (not (m < n))

error('Invalid vector lengths.');end% Calculate match count vector.V = zeros(1, n−m+1);for j = 0 : (n−m)

V(j+1) = sum(S(j+1:j+m) == T);end

Figure 3: Matlab function to calculate the match count vector V

Now that we are able to calculate the match counts, we need to process them to find goodmatches. In our example (table 1), the maximum possible match count is m = 5, which is thelength of T , the smaller string. Assuming that we wish to find the positions with at least 50% ofthe maximum match count, we require no less than

⌈m2

⌉= 3 matches for one substring. Thus,

we extract only those positions j which satisfy the predicate Vj ≥⌈m

2

⌉. This way, we get all the

spikes in figure 2, while ignoring small match counts which are “kind of random”. This filteringis easily done in Matlab code (see figure 4).

8

CHAPTER 2. ANALYSIS 2.2. MOTIVATION

function [J] = find_good_matches(V, m)J = find(V >= ceil(m/2)) − 1;

Figure 4: Matlab function to find positions with at least 50% matches

We have presented an intuitive algorithm for matching with mismatches, implemented it,and it works fine. However, is this algorithm also applicable for our problem of finding similarsections of two different files, given that these files contain much more data than our test inputstrings?

2.2.2 Run Time Considerations

We observe that our algorithm requires n−m+1 steps when iterating through S, and each stepconsists of m element comparisons. Additionally, it requires n−m+1 steps to extract the goodmatches, even if this is done “on the fly”. All in all it will complete in O((n−m+1)(m+1)) =O(nm+n−m2 +1) time.10

For large n with n� m, the run time can be approximated as O(nm). The factor n seemsquite natural, as we need to iterate through (most of) S. However, the factor m is specific to themethod of comparison we are using, so there might be room for optimisation.

In the context of comparing two files, n could be the size of one file, and m could be thesize of a “block” of the second file which we are trying to match. The file size tends to bequite large nowadays, for example nexe = 2,097,152 , 2 MB, with a sample block size ofmexe = 2,048 , 2 KB. Run time is O(nexemexe), which results in approximately 4.2950 · 109

steps, and that would be only for matching a single block. Assuming the second file also hasa size of 2 MB, and we simply want to match all non-overlapping blocks, that would meanmatching nexe

mexe= 1024 blocks, i.e. some O(nexemexe

nexemexe

) = O(n2exe), which are approximately

4.3980 · 1012 steps. This is a quadratic time algorithm, (mostly) independent of the block sizem. Even modern processors cannot compensate for this.

Based on this result, we can be sure that our intuitive algorithm is not quite fast enough (inother words: it is too expensive), and that optimisation is a necessity. There is one thing in ourfavour, though: We do not have the absolute requirement to always calculate the exact matchcounts and find only the best matches. If we do not find them, the calculated difference betweenthe two files will be larger, which will result in larger patches, but we will still succeed. Withthat in mind, one option to speed up this calculation is to estimate the match count vector V

using a randomized algorithm with a sufficiently high chance of success. This leads us to thenew algorithm of matching with mismatches as described in [1].

10For a description of the O-Notation as used in this thesis, see [6] on pages 44-45.

9

2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS

2.3 Derivation of the Algorithm

2.3.1 Formal Model

Since we could not find a suitable algorithm quickly and intuitively, we now use a formalway to deal with the problem. Thus, we need to formalize the problem when comparing twosimilar versions of one executable program. If we choose a block of one file and try to locate asimilar block in the other file, we expect to find exactly one good match. This is at the positionwhere the corresponding code before applying the security fix is present. For the sake of amore general approach, instead of simply considering the best match, we plan to find the t bestmatches. Additionally, there might be not-so-good matches which occur by chance and arerandom. These need to be considered, because we prefer the good matches over them.

Instead of choosing specific example strings S and T , we generate them randomly from analphabet Σ subject to a certain condition: There are some positions within S where T matcheswell, in the sense that each character matches with probability p. The indices of these goodmatches in S are assumed to be elements of the set X . Even more formally, the model we areusing is specified as follows (citing from [1] on pages 7-8):

“Problem space: A problem is determined by a tuple (n, m, t, p, ε, Σ, X) where

{n, m, t} ⊂N, {p, ε} ⊂R, m < n, 0 < ε , 0 < p, |Σ| is even, and X = {x1, . . . , xt} ⊂{0, . . . , n−m} with xi ≤ xi+1−m for 1≤ i < t.

Construction: Let a string T of length m be constructed by selecting m characters

independently and uniformly from the alphabet Σ. Let a string S be constructed by

randomly selecting n characters independently and uniformly from the alphabet Σ.

Let a string S of length n be constructed by independently taking Si = Ti−xk with

probability p if ∃xk ∈ {i−m, . . . , i−1} and Si = Si otherwise [...].”

In other words (with regard to finding differences of files), we create T from randomly chosencharacters to be a “block” of one file. We then create another file S from randomly chosencharacters, but at certain positions in S we copy the block T into the file S. This copy of T is notan exact copy, but an “approximate” copy, and the probability p states how accurate the copy is.For example if p = 0.9, 9 of 10 characters will be correctly copied on average. The constructioncan be performed using a Matlab function as shown in figure 5.

Based on this model, which generates S and T using given positions of good matches, wewish to invert the random construction and find X with a probability of at least 1− ε . In thiscontext, ε is a non-zero parameter which can be set to achieve the desired “accuracy”, butchoosing ε will impose certain restrictions on other input values, as we will see later.

In addition to the problem of reconstructing X , we wish to identify the parts of the algorithmwhich are independent of T . Following the nomenclature of [1], we call a pre-calculation ofthese parts an “index” of the algorithm. As we need to match several blocks (different stringsT ) with the same target file (constant string S), such an index can speed up the processing.

10

CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM

function [S, T] = construct(n, m, p, Sigma, X)% X and Sigma should be sets.if (not (length(X) == length(unique(X))))

error('X is not a set');endif (not (length(Sigma) == length(unique(Sigma))))

error('Sigma is not a set');end% Check construction predicates.if (not (m < n && 0 < p && mod(length(Sigma), 2) == 0 ...

&& sum(ismember(X, [0:n−m])) == length(X)))error('Invalid construction value.');

endfor i = 1 : length(X) − 1

if (not (X(i) <= X(i+1) − m))error('Invalid set X.');

endend% Choose T independently and uniformly distributed from the alphabet.T = randsrc(1, m, Sigma);% Default: Choose S independently and uniformly distributed.S = randsrc(1, n, Sigma);% At a set of offsets, characters of S and T are more likely to match.for i = 1 : n

x_k = intersect(X, [i−m:i−1]);if (length(x_k) == 1 && rand() <= p)

S(i) = T(i−x_k);end

end

Figure 5: Matlab function to construct strings S and T according to the formal model

However, this chapter mainly deals with the reconstruction of X ; an index is generated in theC++ implementation in section 3.2.4.

2.3.2 Restrictions of the Model

As Colin Percival points out, “the model given above is quite restrictive” ([1] on page 8) anddoes not completely fit the situation in practice. One thing to note is that bytes of executable filesusually are neither uniformly distributed nor independent. This will be considered in section4.2.2.

Whenever dealing with random string constructions, however, we have to consider that allkinds of strings could be constructed, even strings we did not specifically have in mind whendefining the model. Namely, there might be randomly good matches of T in S at positions whichare not in X . If this happens, the random symbols have formed roughly the substring T withinS just out of pure luck, for example as shown in figure 6.

These kind of matches are distracting in our model, because our aim is to compute X andonly X . Let us assume we have at least one such good match, specifically at position i withi /∈ X ⇒ i ∈ X = {0, . . . , n−m}−X . Hence, when estimating X , we cannot be sure that X

11


Σ = {a, b, c, . . . , z}X = {1, 12}T = ”abc”

S = ”fabc︸︷︷︸1∈X

klabc6/∈X

hxvabc︸︷︷︸12∈X

s”

Figure 6: Construction example: At position 6 in S the string T randomly matches.

contains the positions of maximum match counts, because Vi (our lucky match count) mightactually be larger. Thus, we will have to guess the elements of X , which reduces the probabilityto find the correct X to 0.5 or less.

We therefore conclude: To effectively estimate X without “fifty-fifty-guessing”, all matchcounts at positions not in X must be smaller than those of positions in X , i.e.:

∀(i, j) ∈ X×X : Vi < Vj (2.3)

The probability q that a randomly good match happens, i.e. that the predicate (2.3) is nottrue for a specific pair (i, j), depends on several parameters (see also [1] on page 8): The firstparameter is m, the size of string T . The smaller m is, the more likely we are to encountera randomly good match. The second parameter is p, because if p is near zero, the supposed“good” matches in X will not be very good, and it will be more likely that a random matchoccurs which is “better”. The third parameter is the size of the alphabet, namely |Σ|. The largerthe alphabet, the smaller is the probability of a randomly good match (if the other factors areconstant). Implicitly, and as the last parameter, the probability also depends on |X |. For exampleif X = {0, m, 2m, ...hm} with h ∈N, hm≤ n−m it is not possible to locate any randomly goodmatches which are not in X .11

We assume m to be the most important of these factors, because it is variable and dependson the string T , while p and |Σ| will be constant in most applications of the model, and |X | isconsidered to be very small, i.e. |X | � n

m . Hence, as an example, we numerically calculate q

for m = [1, 15] with constant p = 0.7, |Σ|= 2 and |X |= 1. Figure 7 shows the result.12

Our numerical example suggests that q(m) with the assumptions above is an exponentialfunction, which is given in [1] on page 8 as

q(m) = exp(−O(m)) (2.4)

As apparent from figure 7, q becomes negligible for large values of m. So, basically, thislimitation of the model imposes some lower bound on m when reconstructing X .

However, if we leave the model for a moment and switch back to what we intend to do, thatis matching two files, this limitation turns out to be insignificant for the following reasons:

11The factor |X | is not mentioned in [1], but it is included here to show that the calculation of q is quite complex.Equation (2.4), also given in [1] on page 8, is based on certain assumptions and cannot be applied in general.

12The script used to create this plot can be found in the appendix.

12


Figure 7: Plot of the probability of a randomly good match not in X

1. If we found random matches between two files that we did not “expect”, we would behappy and would gladly accept them.

2. The block size m when matching two files should not be too small anyway, so that propercompression can be achieved.

3. We will set p to be near 1 anyway, otherwise we would not be able to achieve properresults.

Even though we have identified certain restrictions of the model, we can continue with our workwith the knowledge that these limitations will not block the progress of solving our problem.

2.3.3 Optimisation using the FFT

Based on the model, we intend to derive an improved algorithm for matching with mismatches.As a first optimisation, the match count vector V in equation (2.1) can be calculated with thehelp of the Fast Fourier Transform. This requires O(n

√m log(m) time for matching one block

(see [1] on page 7)13, which is less than the O(nm) time of our intuitive algorithm, but there isstill potential for improvement.

To improve this, we note that we do not specifically need the vector V as described inequation (2.1). When processing V , we extract ”large” values, thus a vector containing spikesat the same positions as V would also be perfectly fine. A vector with this characteristic is

13Unfortunately, the references mentioned in [1] on using the FFT to calculate V do not cover the FFT at all,and are therefore not very useful in this context.

13


the cyclic correlation14 of the two strings S and T , when treating them as discrete signals withcertain properties.

Assuming φ : Σ→ R is a function which converts a character to a signal value,15 the cycliccorrelation C is calculated as follows:16

∀ j ∈ {0, . . . , n−1} :

A j = φ(S j) (2.5)

B j =

{φ(Tj)0

if j < m

otherwise(2.6)

C j =n−1

∑r=0

A(r+ j) mod n ·Br (2.7)

In this calculation, the string S is converted to the signal vector A, and T is converted to B

with zero padding, such that A and B have the same size. In order to retrieve proper results, wehave to define φ in a way that it does not “weight” characters differently. As counter-example,defining φ to map each element of Σ to a unique numerical representation

given Σ = {x1, x2, . . . , x|Σ|}=|Σ|⋃j=1

{x j} :

φ(x j) = j (2.8)

will not produce proper results, because matching x1 and x1 (which are equal) when calculatingthe cyclic correlation will yield 1 · 1 = 1,17 while matching x1 and x2 (which are non-equal)will produce 1 · 2 = 2. In other words, certain mismatches will count much more than certainmatches, and this is not the result we wish to have.

Instead, we define φ to randomly map half of Σ to 1 and the other half to −1 (similar to [1]on page 12):

Choose Σ′ ⊂ Σ with

∣∣Σ′∣∣= 12|Σ| uniformly at random.

φ(x) = (−1)|Σ′∩{x}| (2.9)

Note that in case |Σ| > 2 this is a lossy conversion of the characters to signal values, sinceit maps Σ to {−1, 1}. However, in contrast to equation (2.8), mismatches will never produce

14see e.g. [11] on page 7215We do not consider the case that the function φ maps to C (as mentioned in [1] on page 27), because for all

our requirements mapping to R is sufficient.16This is a simplification based on parts of algorithm 1.1 in [1] on page 12.17The underlying operation in equation (2.7) is a multiplication.

14


larger values than matches: Matching x1 and x1 will produce 1, matching x1 and x2 will produceeither 1 or −1, depending on how Σ′ was chosen.

Figure 8 shows the calculation of the cyclic correlation in Matlab code using φ as definedin equation (2.9).

function [C] = match_cyclic_correl(S, T, Sigma)% Retrieve lengths.n = length(S);m = length(T);% Sigma should be a set and sorted. We assume it to be continuous.alphabetSize = length(Sigma);if (not (alphabetSize == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');end% Check input predicates.if (not (m < n && mod(alphabetSize, 2) == 0))

error('Invalid input value.');endSigma_base = double(Sigma(1));% Calculate phi.tmp_phi = ones(1, alphabetSize, 'single');for j = 1 : alphabetSize/2

tmp_phi(j) = single(−1);end% Use a random mapping.phi = intrlv(tmp_phi, randperm(alphabetSize));% Convert S and T to A and B.A = zeros(1, n, 'single');B = zeros(1, n, 'single');for j = 0 : n − 1

A(j + 1) = phi(double(S(j + 1)) − Sigma_base + 1);if (j < m)

B(j + 1) = phi(double(T(j + 1)) − Sigma_base + 1);end

end% Calculate the cyclic correlation.C = zeros(1, n, 'single');for j = 0 : n − 1

tmpC = single(0);for r = 0 : n − 1

tmpC = tmpC + A(mod(r+j, n) + 1)*B(r+1);endC(j+1) = tmpC;

end

Figure 8: Matlab function to calculate matches using the cyclic correlation

Due to the nested loop in equation (2.7),18 the calculation of C requires O(n2) time. Theactual improvement comes from the fact that the cyclic correlation can be computed using theFFT (see [12] on pages 545-546) in O(n log2(n)) time (according to [10]). The correspondingMatlab code is shown in figure 9. In this context, fft and ifft are the Fast Fourier Transformand the inverse FFT, respectively, and conj is the complex conjugate.

18because this equation is applied ∀ j ∈ {0, . . . , n−1}

15


% Calculate the cyclic correlation using the fft.C = ifft(fft(A) .* conj(fft(B)));

Figure 9: Matlab code to calculate the cyclic correlation using the FFT

The resulting vector C does not necessarily contain match counts. Mismatches decreasevalues in C, e.g., for m = 15 a match count of 10 at position j results in C j = 10−5 = 5. Thismeans that processing the C as we processed V in figure 4 by filtering values ≥

⌈m2

⌉might lead

to different results. This is specifically true for |Σ|> 2, because in that case the string-to-signalconversion is lossy. Thus, there is more “background noise” than in the match count vector V .Performing the matching of the example strings (see figure 1) using the cyclic correlation leadsto a result19 as plotted in figure 10. Negative values in this plot are clear mismatches (becausethey are the result of accumulated−1 ·1 multiplications, see equation (2.7)). Positive values arelikely but not guaranteed to be matches, depending on how suitable the randomly generated φ

is for our example strings.

Figure 10: Plot of the cyclic correlation C calculated from example strings S and T

To find the t best matches using the result of the cyclic correlation, we identify the t positionsfor which C takes the largest values. In our example, the positions 7 and 19 both indicatefull matches, although in fact position 7 only matches 4 of 5 characters (see table 1). This

19The result may vary because the algorithm is randomized.

16


emphasizes the fact that our new method does not always lead to correct results. When choosingφ unluckily, we might even find full matches at positions where none are present.

However, the method is still very useful, if certain conditions are met. Let us now applyour model and create S and T according to it, which means that their characters are uniformlydistributed20. If m is large enough to make up for the (possibly) lossy conversion done by φ ,21

the spikes within C will (with high probability) be the good matches we wish to find, since T

matches within S “well or not at all” ([1] page 6).Our initial problem remains, though: The run time is dominated by the O(n log2(n)) time

required for the FFT, which does not scale well for our purpose.22 In addition to that, memoryusage is O(3n),23 which can be too much for use with large files.24 The next step is therefore to“shorten” the lengths of the vectors A and B before calculating the cyclic correlation, while stillretaining the necessary information.

2.3.4 Projecting onto Subspaces

In order to be able to reduce the size of the data before calculating the FFT, we need to finda projection (preferably lossless or with only small loss) which reduces the vector sizes inequations (2.5) and (2.6) but maintains the basic properties such that we can still perform thecyclic correlation and extract proper results.

A Simplified Approach

We now introduce such a projection (based on [1], page 9), starting with the special case |X |= 1and |Σ| = 2, which basically means that we are only looking for the best match of T in S,whereby the conversions of strings to signals are lossless. For this case, we construct an exampleto show how the projection is performed and to explain the mathematical background.

n = 10000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [9000];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl(S, T, Sigma);

Figure 11: Matlab calls to construct data according to the model (|X |= 1)

20except for the substrings in S which match well with T21To be more exact, this does not solely depend on m, but choosing a large m is one way to deal with this

problem.22In fact, this method requires even more time than the FFT-based calculation mentioned at the beginning of

this section, but it leaves room for optimisation.23implicitly depending on the system-specific size of floating point values24At least one would prefer low memory usage, especially when multiple file patches need to be generated in

parallel.

17


Figure 11 lists the Matlab calls used to create example data according to the model. Notethat we choose n

m� |X | and select m not to be too small, to make sure that the restrictions of themodel (section 2.3.2) do not apply. Figure 12 shows the resulting vector C (following equation(2.7)). There is significant noise and a spike at position j = 9000, which we expected becauseX = {9000}.

Figure 12: Plot of the cyclic correlation C of model data (|X |= 1)

Now we need to reduce the data size and still get the position of the maximum j = 9000as result. Assuming we can somehow reduce the data size modulo a prime number, we couldextract the position modulo this prime. Performing this several times with different primes willgive us the position modulo multiple primes. To calculate the actual result we can make useof the Chinese Remainder Theorem, which states “that it is possible to reconstruct integers ina certain range from their residues modulo a set of coprime moduli” ([7], page 194). This ispossible if the integer x we wish to reconstruct follows the predicate 0 ≤ x < M where M =p1 p2 . . . pk is the product of the coprime integers (with k being the number of primes).

For example, we can choose the primes p1 = 5003, p2 = 6007 (being about n2 with enough

difference, this choice is for simplicity)25, and perform the character-to-signal conversions ac-cumulated modulo each of the primes (based on algorithm 1.1 in [1] on page 12):∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi−1}:

25The reconstruction is clearly possible because 0 ≤ n = 10000 < p1 p2 = 30053021. We are not required tochoose primes, coprime values are sufficient, but we prepare for later changes to the algorithm.

18


A(i)j =

⌈n− jpi

⌉−1

∑λ=0

φ(S j+λ pi) (2.10)

B(i)j =

⌈m− j

pi

⌉−1

∑λ=0

φ(Tj+λ pi) (2.11)

C(i)j =

pi−1

∑r=0

A(i)(r+ j) mod n ·B

(i)r (2.12)

This means that we shorten the original vector A j = φ(S j) by adding up “roughly the secondhalf of the vector to the first half” (with a different boundary for each prime). Equation (2.10)specifies a projection from Σn→Rpi, ∀i∈ {1, . . . , k}. Given pi < n, this is a lossy (irreversible)projection, but it maintains certain characteristics. In our example, instead of one large vectorA, we now have two vectors: A(1) with size 5003 and A(2) with size 6007. Vector B is lessconcerned – only some zeros are cut from the end, since in our case m < p1 < p2.26

26Basically, we could use the same definition for B(i) as in equation (2.6), but the new definition covers thegeneral case and is therefore preferable.

19


function [C] = match_cyclic_correl_project(S, T, Sigma, Primes)% Retrieve lengths.n = length(S);m = length(T);k = length(Primes);% Sigma should be a set and sorted.% We assume it to be numeric, with values >= 0 and continuous.alphabet_size = length(Sigma);if (not (alphabet_size == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');end% Check input predicates.if (not (m < n && mod(alphabet_size, 2) == 0 && Sigma(1) >= 0))

error('Invalid input value.');endsigma_base = double(Sigma(1));% Calculate phi (character to signal conversion).tmp_phi = ones(1, alphabet_size, 'single');for j = 1 : alphabet_size/2

tmp_phi(j) = single(−1);end% Use a random mapping.phi = intrlv(tmp_phi, randperm(alphabet_size));% Convert S and T to A and B, project to subspaces.for i = 1 : k

A{i} = zeros(1, Primes(i), 'single');B{i} = zeros(1, Primes(i), 'single');for j = 0 : Primes(i) − 1

tmpA = single(0);tmpB = single(0);for lambda = 0 : ceil((n−j)/Primes(i))−1

tmpA = tmpA + phi(double(S(j+lambda*Primes(i)+1)) ...− sigma_base + 1);

endfor lambda = 0 : ceil((m−j)/Primes(i))−1

tmpB = tmpB + phi(double(T(j+lambda*Primes(i)+1)) ...− sigma_base + 1);

endA{i}(j+1) = tmpA;B{i}(j+1) = tmpB;

endend% Calculate the cyclic correlation using the fft.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));end

Figure 13: Matlab function to project data before calculating the cyclic correlation

The cyclic correlation is calculated for each of these smaller vectors (see equation (2.12)).Figure 13 shows a Matlab function implementing this “projection onto subspaces”; figures 14and 15 show the resulting vectors C(1) and C(2) for our example. Due to adding up the valuesbefore calculating the correlation, the level of noise has increased compared to figure 12, butthe maximum value still raises clearly above the noise.

20


Figure 14: Plot of the cyclic correlation C(1) of projected model data (|X |= 1)


The positions of maximum values in these vectors C(i) are (with high probability) the maxi-mum position of the original vector C modulo each of the primes pi. Using the residues modulothese primes, we can reconstruct the position of the maximum correlation. In our example the

21


residues are 9000 mod 5003 = 3997 (see figure 14) and 9000 mod 6007 = 2993 (see figure 15).The reconstruction following the Chinese Remainder Theorem is done as follows (see also [7]page 194f):∀i ∈ {1, . . . , k} :

x≡ ai (mod pi) : x≡ 3997 (mod 5003) (2.13)

x≡ 2993 (mod 6007)

M =k

∏i=1

pi : M = 5003 ·6007 = 30053021 (2.14)

Mi =Mpi

: M1 =30053021

5003= 6007 (2.15)

M2 =30053021

6007= 5003

NiMi ≡ 1 (mod pi) : N1 ·6007≡ 1 (mod 5003) (2.16)

has solution N1 =−294

N2 ·5003≡ 1 (mod 6007)

has solution N2 = 353

Finally, the underlying value x can be calculated:

x≡ a1N1M1 + · · ·+akNkMk (mod M) : (2.17)

x ≡−3997 ·294 ·6007+2993 ·353 ·5003 (mod 30053021)

≡−1773119239 (mod 30053021)

≡ 9000 (mod 30053021)

While most of the steps are fairly straightforward, the solution of equation (2.16) requiressome work. Rearranging it leads to a form which can be solved more easily:

NiMi ≡ 1 (mod pi)

⇔ NiMi = 1− r · pi, r ∈ Z

⇔ NiMi + r · pi = 1 (2.18)

Since gcd(Mi, pi) = 1 (by definition of Mi and pi),27 we can apply the extended Euclideanalgorithm (see [6] on pages 859-860) to calculate Ni.

27gcd: greatest common divisor

22


Figure 16 shows a Matlab function which performs the reconstruction of an integer accord-ing to the Chinese Remainder Theorem.

function [x] = solve_crt(Primes, Residues)num_primes = length(Primes);prime_prod = prod(Primes(1 : num_primes));for i = 1 : num_primes

M_i = prime_prod/Primes(i);% Use extended euclidian algorithm[g, r, N_i] = gcd(Primes(i), M_i);% g = r * Primes(i) + N_i * M_i% with g = 1 because Primes(i) and M_i are coprime.NM{i} = N_i * M_i;

endx = 0;for i = 1 : num_primes

x = x + Residues(i) * NM{i};endx = mod(x, prime_prod);

Figure 16: Chinese Remainder Theorem: Matlab function which calculates the solution

Using this algorithm, the position x of the maximum correlation can be uniquely calculatedas long as n < M. Instead of one large cyclic correlation, we now have multiple smaller cycliccorrelations to calculate, which is an improvement in terms of speed, because the time spentcalculating the FFT grows ”faster than linear” with n. There are drawbacks which need to bekept in mind, however:

1. The level of noise increases, because we add up values in vectors A(i) and (possibly) B(i).We need to make sure that, at least with high probability, the noise does not increase toomuch.

2. Finding the position of the maximum correlation is more complex now, but if the timegain is sufficiently large, it is worth the effort.

A Generic Approach

In our introduction of the projection, we only handled the special case that |X |= 1 and |Σ|= 2.Now we extend the previous approach to present a generic solution. For a start, we stick with|Σ|= 2, but consider varying |X |.

The case |X | = 0 does not need to be handled, because it means zero solutions need to befound, and we have instantly finished. What needs to be covered is the general case |X | ≥ 1.

If we apply the same procedure as above, we stumble upon the fact that we cannot simplyuse the Chinese Remainder Theorem for the general case. Actually, we are able to reconstructintegers from their residues modulo some primes, but if we consider multiple solutions, wehave multiple residues for each prime and do not know which integer they belong to.

23


n = 10000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [7700, 8050, 9000];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl(S, T, Sigma);

Figure 17: Matlab calls to construct data according to the model (|X |= 3)

As an example, consider model data with |X |= 3, generated using the Matlab calls in figure17. The cyclic correlation C (see equation (2.7)) of this data has three spikes at positions whichare elements of X (see figure 18). We need to reconstruct these three positions from theirresidues modulo the primes p1 and p2 as above.

Figure 18: Plot of the cyclic correlation C of model data (|X |= 3)

Each of the cyclic correlations C(i) of projected data (calculated as in equation (2.12)) alsohas three spikes (see figures 19 and 20). Thus, when extracting the positions of the three largestvalues from each of the vectors C(i), we have three residues modulo each prime. However,unfortunately it is not apparent which of the residues modulo one prime ”belongs to” a specificresidue modulo another prime to reconstruct one of the results.

Intuitively, one might consider to simply calculate the result x using the Chinese RemainderTheorem for all combinations of residues and check each time whether it is a valid value (i.e.whether x≤ n−m according to the model). This actually works fairly well, given M� n−m,such that it is possible to drop invalid combinations.

24




25


The residues and corresponding solutions for our example are shown in table 2 (with a1

being position in C(1), and a2 being position in C(2)). The valid results, i.e. those which are≤ 9744, are marked in bold. As shown in the table, we have successfully reconstructed theelements of X – all other combinations of residues lead to values well out of range.

Table 2: Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000}

a1 a2 x (mod 30053021)2697 1693 77002697 2043 170679302697 2993 118548043047 1693 130008413047 2043 80503047 2993 248479453997 1693 182149173997 2043 52221263997 2993 9000

Theorem 1.1 in [1] on page 10 establishes a lower bound on the probability of whetherthis reconstruction will lead only to the correct results.28 The problem is formulated slightlydifferent in this theorem. It starts with a set of candidates for the solution, namely {0, . . . , n−1}(assuming m = 1, the minimum reasonable value). For each prime, only those values of this setwhich are elements of one of the residue classes (of the actual results modulo the prime) areaccepted (see [1] on page 10):

X = {0, . . . , n−1}∩ (X + p1Z)∩·· ·∩ (X + pkZ) (2.19)

The ”filtering” by intersecting for each prime is basically the same as trying all combinationsand removing invalid results. The set X will always contain the correct results (because allcombinations of residues are considered), but it might also contain additional values. A lowerprobability bound on the condition X = X according to [1] is

1−n(

t log(n) log(L))L

)k

(2.20)

with L∈R, L≥ 5 specifying the interval [L, L(1+2/ log(L))) from which the primes p1, . . . , pk

are randomly selected; and t = |X | according to our model definition. This probability bound isused by Colin Percival for further proofs, which is why theorem 1.1 is the very foundation ofthe algorithm proposed in [1]. Still, what is missing in [1] is a critical analysis of this theorem.We provide this, together with suggestions for improvement, in section 4.1.

For now we choose input values such that the lower probability bound in equation (2.20) isnear 1. Hence, we can expect that the set X is properly reconstructed most of the time. To be ableto choose input values accordingly, we need to select primes from the specified interval. Figure

28This theorem depends on pi being prime and not just coprime ∀i ∈ {1, . . . , k}, which is why we used primenumbers from the start.

26


21 shows a Matlab function which performs this task. Please note that this function actuallycreates a random permutation of primes instead of selecting them ”uniformly at random” (asdescribed in [1]), which makes it behave slightly different. This is explained in section 4.1.1.

function [P] = select_primes(L, k)% Check selection predicate.if (not (L >= 5 && k >= 1))

error('Invalid input for prime selection.');end% Round upper bound.primes_upper_bound = L * (1 + 2 / log(L));max_prime = fix(primes_upper_bound);% Make sure that values are < upper bound.if (max_prime == primes_upper_bound)

max_prime = max_prime − 1;end% Retrieve subset of the set of primes.% Use a permutation to prevent double primes.primes_set = setdiff(primes(max_prime), primes(L − 1));primes_set = intrlv(primes_set, randperm(length(primes_set)));P = primes_set(1:k);

Figure 21: Matlab function to randomly select primes for the reconstruction of X

Now we have provided a solution which covers the general case |X | ≥ 1. The solution hasthe drawback of restricting some of the input values, an issue which we will revisit in section2.4.2.

To provide a fully generic approach, we still need to cover the case |Σ| ≥ 2. However, thisis only a small problem: We recall the fact that |Σ|> 2 will cause φ(x) to be a lossy conversionfrom characters to signal values. This means that when comparing the signal values of differ-ent characters, they might turn out to be equal. However, as we discussed in section 2.3.2, thelarger |Σ|, the less likely are randomly good matches of T in S (if all other factors are constant).These two effects compensate each other. While the lossy character-to-signal conversion in-creases the level of the noise in which to find good matches, the reduced probability of randommatches decreases the noise. Figure 22 shows vector C constructed as in figure 17 but withΣ = {0, . . . , 65535}. Compared to figure 18 we do not observe any difference except for the in-fluence of the randomly constructed input values. Actually, there is a slight difference in detail:Since φ is a “randomly created” function, there is a certain chance that we create an “unlucky”φ for our specific input values. This issue will be handled in the next section.

Comparing C to the match count vector V (see equation (2.1) on page 7) with varying |Σ|reveals another issue: For larger |Σ|, the level of noise in V is reduced, while the level of noisein C remains constant. This should be kept in mind for applications where C is meant to be usedas direct replacement for V .

27


Figure 22: Plot of the cyclic correlation C of model data (|X |= 3, |Σ|= 65536)

2.3.5 Optimising Random Behaviour

As we mentioned above, for |Σ| > 2 the conversion performed by φ is lossy. It can randomlyhappen, though, that φ is defined (see equation (2.9)) such that it performs an “unlucky” con-version of specific input strings. By “unlucky” we mean that the resulting C has a spike ata position where there is no good match, because φ converts different characters to the samesignal values. Figure 23 shows an example for an unluckily defined φ in the context of certaininput strings.

Σ = {a, b, c, d}

X = {1}

T = ”abbd”

S = ”dabbd︸︷︷︸1∈X

cbabc︸︷︷︸6=T

caddc”

φ(x) ={

1 if x = a or x = b−1 otherwise

⇒ B = (1, 1, 1,−1, 0, 0, . . .)

⇒ A = (−1, 1, 1, 1,−1,︸︷︷︸1∈X

−1, 1, 1, 1,−1,︸︷︷︸6/∈X

−1, 1,−1,−1,−1)

Figure 23: Construction example: At position 6 in S a non-existing match of T is found

28


For specific input strings this problem can often be solved by choosing a different (“lucky”)φ . In figure 24 φ is redefined, which leads to the correct result.

φ(x) ={

1 if x = a or x = c−1 otherwise

⇒ B = (1,−1,−1,−1, 0, 0, . . .)

⇒ A = (−1, 1,−1,−1,−1,︸︷︷︸1∈X

1,−1, 1,−1, 1, 1, 1,−1,−1, 1)

Figure 24: Construction according to figure 23 with a different φ

However, there is no general solution which works for all input strings as long as φ is alossy conversion. Given φ , it is always possible to maliciously construct input strings such thata good match of T in S is found where there is none. One tempting way to reduce the probabilityof finding “false matches” is to analyse the input strings and define φ purposeful instead of atrandom. This possibility is discussed later in section 4.2.2.

Nevertheless, there is something else we can do: In the equations (2.10) and (2.11), thesame φ is used to calculate the vectors A(i) and B(i) for all i ∈ {1, . . . , k}. This means that if afalse match occurs, it will occur in all vectors C(i), at positions modulo the respective primes.If we use a different φi for each i to calculate the vectors A(i) and B(i), false matches are stillpossible, but most likely on different positions for each prime, and therefore likely to be filteredout as invalid results (as in table 2). Extending our previous equations (2.9), (2.10) and (2.11)we now have (see also [1] on page 12):

∀i ∈ {1, . . . , k}:

Choose Σi ⊂ Σ with |Σi|=12|Σ| uniformly at random.

φi(x) = (−1)|Σi∩{x}| (2.21)

∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi−1}:

A(i)j =

⌈n− jpi

⌉−1

∑λ=0

φi(S j+λ pi) (2.22)

B(i)j =

⌈m− j

pi

⌉−1

∑λ=0

φi(Tj+λ pi) (2.23)

For further usage, figure 25 shows a Matlab function performing the creation of φi and theprojection accordingly.

29

2.4. THE ALGORITHM CHAPTER 2. ANALYSIS

function [A, B] = project_onto_subspaces(S, T, Primes, Sigma)n = length(S);m = length(T);alphabet_size = length(Sigma);sigma_base = double(Sigma(1));k = length(Primes);% phi maps half of Sigma to −1, the other half to 1.tmp_phi = ones(1, alphabet_size, 'single');for j = 1 : alphabet_size/2

tmp_phi(j) = single(−1);endphi = zeros(k, alphabet_size, 'single');for i = 1 : k

% Use a random mapping.phi(i, :) = intrlv(tmp_phi, randperm(alphabet_size));

end% Perform projection onto subspaces of prime dimensions.for i = 1 : k

A{i} = zeros(1, Primes(i), 'single');B{i} = zeros(1, Primes(i), 'single');for j = 0 : Primes(i)−1

tmpA = single(0);tmpB = single(0);for lambda = 0 : ceil((n−j)/Primes(i))−1

tmpA = tmpA + phi(i, double(S(j+lambda*Primes(i)+1)) ...− sigma_base + 1);

endfor lambda = 0 : ceil((m−j)/Primes(i))−1

tmpB = tmpB + phi(i, double(T(j+lambda*Primes(i)+1)) ...− sigma_base + 1);

endA{i}(j+1) = tmpA;B{i}(j+1) = tmpB;

endend

Figure 25: Matlab function to project data with varying φi(x)

2.4 The Algorithm

2.4.1 Reference Version

In the following sections, we present different variants of the algorithm to estimate X accordingto the model. Figure shows 26 an algorithm to be used as base for later comparisons andtests. This algorithm is not specified in [1], we have created it by strongly simplifying the firstalgorithm presented there. It implements only the optimisation using the FFT (see section 2.3.3)and does not reduce the data size. To be able to compare the run time of the different algorithms,calls to the Matlab functions tic and toc are added at the beginning respective the end of thefunction. tic starts the timer, and when calling toc the elapsed time is displayed.

The Matlab implementation of this algorithm uses a few functions which have not yet beenintroduced:

30

CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM

check_model_predicates This function checks input predicates according to our formalmodel (see section 2.3.1), and aborts if they are not met. It isshown in figure 27.

match_cyclic_correl_fft This function computes the cyclic correlation C as in figure 8by using the FFT as in figure 9 (see page 16). For the sake ofclarity, this function is provided in the appendix.

pos_of_largest_val Given two parameters C and t, this function retrieves thepositions of the t largest values in C.29 This is beyond thescope of this thesis and is therefore only presented in the ap-pendix.30

function [Xest] = algorithm_simple(S, T, Sigma, t)ticn = length(S);m = length(T);% Check input predicates. Use placeholder if variable not needed.check_model_predicates(n, m, 0.9, Sigma, t, 0.1);% Calculate the cyclic correlation using the FFT.C = match_cyclic_correl_fft(S, T, Sigma);Xest = pos_of_largest_val(C, t)−1;toc

Figure 26: Simple Matlab function to perform “matching with mismatches”

function check_model_predicates(n, m, p, Sigma, t, epsilon)if (not (length(Sigma) == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');endif (not (m < n && 0 < epsilon && 0 < p && mod(length(Sigma), 2) == 0 ...

&& Sigma(1) >= 0))error('Invalid input value.');

endif (not (t > 0))

error('Nothing to do.');end

Figure 27: Matlab function to check input values according to the model

29The returned positions are Matlab indices, which is why an additional decrement is performed to get zero-based positions.

30Actually, our implementation of the function has a run time of O(tL), assuming that the length of C is L.Using an algorithm based on a priority queue (see e.g. [6] on page 194), a run time of O(L) can be achieved, butthis would add a lot of complexity.

31


Description

After checking its input values according to the model restrictions, this algorithm basicallyperforms all the steps described in section 2.3.3 on page 13. It converts the strings S and T todiscrete signals A and B through use of the φ function and computes the cyclic correlation C ofthese signals. The set X is then estimated by extracting the positions of the t largest values ofthe cyclic correlation.

Example

Figure 28 shows an example application of the algorithm. Note that the resulting elements inXest might be in a different order than in the supplied X . This, however, is assumed not to bea problem.31

n = 102400;m = 512;p = 0.9;Sigma = uint8([0:255]);X = [34, 39411, 101410][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_simple(S, T, Sigma, length(X))

Figure 28: Matlab code which demonstrates the usage of our reference algorithm

Review

This algorithm has one great benefit: Its run time relies mostly on the FFT; other than that onlysimple processing is done. Matlab uses the FFTW library [22] for fft/ifft-calls, which hasbeen heavily optimised and is very fast even for large input sizes. There are several drawbacks,however: Even FFTW does not calculate the FFT faster than O(n log2(n)), and memory usageis very high. Also, false matches according to section 2.3.5 on page 28 cannot be filtered.

2.4.2 The First Variant

Based on our previous analysis, we are about to present the first version of the algorithm formatching with mismatches according to [1]. Only two minor issues remain to be solved:

1. We need to deduce restrictions of input values from the probability bound in equation(2.20). These restrictions should be based on ε , where 1−ε is the probability of correctlyreconstructing X (according to our model). If ε is near 0, a high chance to succeed willbe guaranteed, and thus the limits on input values are more strict. If ε is near 1, therestrictions on input values will be more graceful.

31If it is a problem, the resulting set can be sorted.

32


2. In section 2.3.3 we have emphasized the fact that vector C does not generally containmatch counts, and therefore we cannot easily process C to extract all positions with atleast 50% matches. This is even more true of the vectors C(i), because some valueshave been added up. Extracting only the t largest values, as we proposed for C, has thedrawback that one of these largest values might be a “false match” according to section2.3.5, with the effect that we possibly miss one of the t results. Consequently, we stillneed to specify how to extract spikes from vectors C(i).

For both issues we accept the solutions given in [1].

1. The input value constraints are specified as follows (see [1] on page 11):

16 log(4n/ε)p2 < m < min

(√32nε

t logn,

8(√

n+1) log(4n/ε)p2

)(2.24)

The number of primes k and the minimum size of the primes L are calculated from inputvalues and therefore indirectly restricted (see [1] on page 12):

k =⌈

log(2n/ε)log(8n)− log(mt log(n))

⌉(2.25)

L =8n log(2kn/ε)

mp2−8log(2kn/ε)(2.26)

According to [1] on page 28 the “restrictions placed upon the input parameters, and thevalues assigned to L, have naturally erred on the side of caution”. This means that theyhave been chosen such that the proofs concerning the algorithm are “successful”, butapparently some of them are also the results of trial and error to prevent certain bordercases. For now we accept these constraints, but later we will analyse how restrictive theyare in applications.

2. For each i ∈ {1, . . . , k}, we will process the vector C(i) by finding all positions j withC(i)

j > mp2 (according to [1] on page 12). Please note, however, this does not clearly

define the number of character matches we require, it is simply a bound that tends towork reasonably well with our model.

With these supplements, we finally present a Matlab function implementing “Algorithm 1.1” of[1] (on pages 11-12). It is shown in figure 29. All previous results are used in this implementa-tion, and additionally one new helper function is introduced:

cartesian_prod Given a cell array32 and the number of dimensions, this function calculatesthe Cartesian product and returns it as cell array of vectors. The implemen-tation is beyond the scope of this thesis and is provided in the appendix. Ona side note, the function has been optimised for two dimensions becausethis is the most frequent case.

32A cell array is a Matlab array with dynamic size and content.

33


function [Xest] = algorithm_11(S, T, p, Sigma, t, epsilon, pedantic)ticn = length(S);m = length(T);% Check input predicates.check_model_predicates(n, m, p, Sigma, t, epsilon);% Check additional predicates for the algorithm (if pedantic is nonzero).if (pedantic) && (not ((16*log(4*n/epsilon))/p^2 < m && ...

min(sqrt(32*n*epsilon)/(t*log(n)), ...(8*(sqrt(n)+1)*log(4*n/epsilon))/p^2) > m))

error('Invalid m for this algorithm.');end% Initialization.k = ceil(log(2*n/epsilon)/(log(8*n)−log(m*t*log(n))))L = (8*n*log(2*k*n/epsilon))/(m*p^2−8*log(2*k*n/epsilon))khat = ceil(log(n)/log(L))Xest = [];% Randomly select primes.Primes = select_primes(L, k)% Reduce data size by projecting onto subspaces.[A, B] = project_onto_subspaces(S, T, Primes, Sigma);% Calculate the cyclic correlation using the FFT.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));end% Extract positions of spikes.for i = 1 : k

X_residue{i} = find(C{i} > (m*p)/2) − 1;end% Calculate all khat tuples using a helper function.x_tupel = cartesian_prod(X_residue, khat);% Estimate X by applying the Chinese Remainder Theorem to all tuples.for t = 1 : size(x_tupel, 2)

x = solve_crt(Primes(1:khat), x_tupel{t});% Additional filtering of invalid valuesxvalid = logical(1);for i = khat+1 : k

if (not (ismember(mod(x, Primes(i)), X_residue{i})))xvalid = logical(0);break;

endendif (xvalid && x <= n − m)

Xest = union(Xest, x);end

endtoc

Figure 29: Matlab function implementing “Algorithm 1.1” of [1]

Description

The algorithm starts by checking its input values, first against the limitations of the model andsecond against the specific constraints in equation (2.24). It then initialises k and L accordingto equations (2.25) and (2.26), and also sets a helper variable k (khat) whose sole use is speedoptimisation. It selects k primes as shown in figure 21 on page 27 and then projects the input

34


strings onto subspaces according to section 2.3.5, figure 25 on page 30. Afterwards, the cycliccorrelations C(i) of vectors A(i) and B(i) are calculated using the FFT as in section 2.3.3, figure 9on page 16. Positions in C(i) which are larger than mp

2 are considered good matches and thusextracted, and the Chinese Remainder Theorem is applied combining the extracted positionsmodulo one prime with the positions modulo each other prime (similar to table 2 on page 26).33

Results which are within the valid range are accepted, building the estimated set X .34

Example

Figure 30 shows an example application of this algorithm similar to the “reference example” infigure 28. It turns out that the input restrictions in equation (2.24) are not met by these inputvalues, although algorithm 1.1 does produce the correct solution with high probability (as canbe verified numerically). Thus the last parameter “pedantic” is set to 0 for this example inorder to disable the checking of the input value constraints on m.35 Figure 31 shows a differentexample where the conditions on m are met, and pedantic is set to 1.

n = 102400;m = 512;p = 0.9;epsilon = 0.1;Sigma = uint8([0:255]);X = [34, 39411, 101410][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 0)

Figure 30: Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0)

n = 1000000;m = 325;p = 0.9;epsilon = 0.7;Sigma = uint8([0:255]);X = [999601][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 1)


Review

The first thing we note is that ε is just a probability bound, and one can achieve correct resultswith high numerical probability even if ε is near one. Similarly, the input constraints (equation

33The exact implementation is slightly different, because only k primes are used for the reconstruction, and theresult is checked against the remaining primes. However, this is only an optimisation; the effect is the same.

34Following the model, we only accept values x≤ n−m (instead of x < n as in [1]).35This should be done with caution because L might turn out to be negative, especially for small m.

35


(2.24)) are not irrevocable — we have to comply with them if we wish to use ε as provenprobability bound,36 but often we also achieve good results when ignoring them.

If we provide input values in the context of executable file comparisons, it will be quite hardto stick to the input constraints. To achieve a good encoding of the file differences, we needthe correct result of the matching with high probability, so we choose ε = 0.1. We are mainlyinterested in matches with only a few mismatches, and for that reason set p = 0.9. Further, weassume that we wish to find the two best matches (t = 2), to allow for some flexibility. Giventhese basic input parameters, n needs to be very large for us to be possible to meet the inputconditions. Figure 32 shows the upper and lower limits of m with these input values for varyingn. We observe that n needs to be roughly 8 ·107 simply to be able to select a valid m, and eventhen we are restricted to m≈ 430. Using a block size m of several kilobytes is only possible forhuge values of n, way beyond the usual size of executable files. This also means that the runtime of the algorithm given in [1] (on page 13) is not valid for file comparisons,37 because itrelies (by definition) on

m� 16log(4n/ε)p2 (2.27)

i.e. m is required to be a lot larger than its lower limit.

Figure 32: Limits of m in algorithm 1.1 (ε = 0.1, p = 0.9, t = 2)

There is even more to say about the input constraints: For our second Matlab example(see figure 31) we selected the relatively small (given the restrictions) n = 106 and chose m =325 to be near its lower limit. In this example the input leads to L ≈ 8.9686 · 105, and whilethe condition L < n is true as observed in [1] (page 13), the interval [L, L(1 + 2/ log(L))) ≈

36Note that ε only guarantees a certain probability of success given random input according to our model.37at least not for today’s usual file sizes

36


[8.9686 · 105, 1.0277 · 106) from which the primes are randomly chosen actually exceeds n.Therefore, it can happen that primes larger than n are selected, which is inefficient in terms oftime and memory, and can even lead to incorrect results, because A(i) is zero-padded for thecorresponding i. Even if false results are eventually caught, this is still a situation which isundesirable. Given the calculation of L as it is, the limits on m are probably not strict enough.

Thinking in terms of L instead of considering the interval [L, L(1 + 2/ log(L))) also seemsto be a problem in the proofs of [1]. It specifically makes the proof of the algorithm’s timebound appear questionable: The size of the cyclic correlation is assumed to be L (see [1] onpage 13), but in fact it can be larger, and therefore the size is not L but worst case L(1 +2/ log(L)). Possibly this difference can be ignored for “huge” values of n, but it is neverthelessan assumption which should at least have been justified if used in a proof. Actually, the runtime equation has the precondition n� 1 (see [1] on page 13), implying asymptotic behaviour,but this underlines the fact that the time bound cannot be applied for our application. We haveto expect slower run time, so further improvement of the algorithm is necessary.

Last but not least, given the method how the set X is estimated in this algorithm (by extract-ing many results and filtering those out of range), we have no guarantee to retrieve t results. Wecan get any number of results. For some applications this might be a desired effect, but whenmatching files we usually wish to find the t best matches.

2.4.3 The Second Variant

In the first version of the algorithm, we chose primes for the projection according to theorem 1.1of [1]. If we use smaller primes and still achieve the desired result, we will reduce the size of thecorresponding vectors as well as the processing time. However, using smaller primes will resultin a higher level of background noise, and random spikes can occur more frequently. They caneven raise as high as the matches we wish to find.

In section 2.3.4 we used primes around n2 . For this version, we construct model data using

primes around n10 in order to further reduce the size of the data. The corresponding Matlab calls

are shown in figure 33. This example is similar to figure 17 on page 24, it simply specifies alarger n.

n = 50000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [7700, 8050, 9000]Primes = [5003, 6007];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl_project(S, T, Sigma, Primes);

Figure 33: Matlab calls to construct data using primes around n10

Figures 34 and 35 show the resulting cyclic correlations. As expected, there is a consid-erably higher level of random noise compared to the previous example (figures 19 and 20).

37


For C(1), we expect spikes at positions 2697, 3047, 3997. Those actually are present, but wealso observe random spikes, e.g. roughly at positions 1500, 3100. The expected spikes of C(2)

at positions 1693, 2043, 2993 are also present, but there are additional spikes, e.g. roughly atpositions 100, 2800. However, it is very unlikely that these spikes occur in all projections atpositions which reconstruct a valid result. Therefore, these spikes are “filtered” during recon-struction, because they lead to results > n−m (with high probability). This is basically thesame as in section 2.3.5, where “false matches” are removed.

Figure 34: Plot of the cyclic correlation C(1) of data using primes around n10

Theorem 1.3 in [1] on page 15 addresses this issue from a mathematical point of view. Itextends theorem 1.1 of [1] by considering additional random elements for each of the intersec-tions. Assuming that these elements are selected randomly in sets Y (i) ⊂ {0, . . . , n−1}−X forall i ∈ {1, . . . , k},38 this looks as follows:39

X = {0, . . . , n−1}∩((X ∪Y (1))+ p1Z

)∩·· ·∩

((X ∪Y (k))+ pkZ

)(2.28)

38This definition has been simplified, for more information see [1].39This has been slightly corrected from [1].

38


Figure 35: Plot of the cyclic correlation C(2) of data using primes around n10

The probability bound in equation (2.20) is correspondingly extended by the probabilityβ ∈ [0, 1) of one random element “falling into each of the k sets”40. The new bound is specifiedas follows (see [1] on page 15):

1−n(

β +t log(n) log(L))

L

)k

(2.29)

In principle, we can use the same procedure as in the first variant of the algorithm but withshorter primes, and have a slightly smaller probability of succeeding. To stay within provenbounds, however, we revisit the input value restrictions and the processing of C(i) from theprevious section:

1. The input value constraints should now be deduced from the new probability bound inequation (2.29). Unfortunately, this is not trivial, because it involves establishing propo-sitions about β . Theorem 1.4 in [1] on page 17 performs this task, basically giving guid-ance on how “high” the spikes of the valid results still need to raise above the noise. Theresulting restriction according to [1] on page 19 is (again chosen such that the proof issuccessful):

m < min

(3√

n2ε/2t(log(n))2 ,

√nε/28p2

)(2.30)

40Cited from [1] on page 15. The definition of β given there as “the probability [...] of y falling into each of thek sets” is very fuzzy, since y is undefined.

39


The number of primes k, the probability β and the minimum size of the primes L arederived values and thus indirectly restricted (see [1] on page 20):41

k =⌈

log(2n/ε)log(n)− log(mt log(n)2)

⌉(2.31)

β =12

k√

ε/(2n) (2.32)

y =(√− log(β )+

√log(4kt/ε)

)2(2.33)

L =2ny

mp2−2y(2.34)

We will analyse later how restrictive the limit on m actually is for applications.

2. To solve the problem of algorithm 1.1 that we are not guaranteed to get t results, wecan start by extracting the t largest values in vectors C(i) for each i ∈ {1, . . . , k} andperform the reconstruction. However, as mentioned before, if for any reason we have arandom spike raising above one of the results, we will miss some results and might endup with less than t values (because invalid results are removed). Given that the numberof “additional spikes” in C(i) according to our new approach is expected to be β pi foreach i ∈ {1, . . . , k} (see [1] on page 15), we can assume that, in the worst case, all of theadditional spikes raise above our actual results. Therefore, we are on the safe side if weextract the β pi + t largest values.

Now that these issues are solved, we present a Matlab function implementing “Algorithm 1.2”of [1] (on pages 19-20). It is shown in figure 36.

Description

At first, this algorithm checks its input values against the limitations of the model as well asagainst the constraints given in equation (2.30). Next, it initialises k, β and L according toequations (2.31), (2.32) and (2.34). It selects k primes and then projects the input strings ontosubspaces. In the next step, the cyclic correlations C(i) of vectors A(i) and B(i) are calculatedusing the FFT. The positions of the β pi +t largest values in C(i) are considered as candidates forgood matches and thus extracted, and the Chinese Remainder Theorem is applied combiningthe extracted positions modulo one prime with the positions modulo each other prime (similarto table 2 on page 26). Those results that are within the valid range are accepted, and form theestimated set X .

Example

Unfortunately, the Matlab application of Algorithm 1.2 according to the reference example doesnot run. Figure 37 shows the corresponding calls, ignoring the limit on m in equation (2.30),

41In equation (2.33) we choose the name y instead of x to avoid a naming conflict.

40


function [Xest] = algorithm_12(S, T, p, Sigma, t, epsilon, pedantic)ticn = length(S);m = length(T);% Check input predicates.check_model_predicates(n, m, p, Sigma, t, epsilon);% Check additional predicates for the algorithm (if pedantic is nonzero).if (pedantic) && (not (m < min((((n^2*epsilon)/2)^1/3)/(t*(log(n)^2)), ...

sqrt((n*epsilon)/2)/(8*p^2))))error('Invalid m for this algorithm.');

end% Initialization.k = ceil(log(2*n/epsilon)/(log(n)−log(m*t*log(n)^2)))beta = (1/2)*((epsilon/(2*n))^(1/k))y = (sqrt(−log(beta))+sqrt(log((4*k*t)/epsilon)))^2L = (2*n*y)/(m*p^2−2*y)khat = ceil(log(n)/log(L))Xest = [];% Randomly select primes.Primes = select_primes(L, k)% Reduce data size by projecting onto subspaces.[A, B] = project_onto_subspaces(S, T, Primes, Sigma); pack;% Calculate the cyclic correlation using the FFT.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i}))); pack;end% Extract positions of spikes.for i = 1 : k

X_residue{i} = pos_of_largest_val(C{i}, ceil(beta * Primes(i)) + t)−1;%X_residue{i} = find(C{i} > (m*p)/2) − 1;

end% Calculate all khat tuples using a helper function.x_tupel = cartesian_prod(X_residue, khat);% Estimate X by applying the Chinese Remainder Theorem to all tuples.for t = 1 : size(x_tupel, 2)

x = solve_crt(Primes(1:khat), x_tupel{t});% Additional filtering of invalid valuesxvalid = logical(1);for i = khat+1 : k

if (not (ismember(mod(x, Primes(i)), X_residue{i})))xvalid = logical(0);break;

endendif (xvalid && x <= n − m)

Xest = union(Xest, x);end

endtoc


i.e. using 0 for the parameter “pedantic”. However, the algorithm fails, because k turns outto be negative. This points to one major weakness of the algorithm: For our use either n isrequired to be huge, or m needs to be very, very small. In contrast to Algorithm 1.1, whichtends to work well outside the specified limits of m, Algorithm 1.2 seems to strongly depend on

41


the limitation. Even if we choose m such that k is barely not negative (still ignoring the limit),k is unnecessarily large (e.g. around 20), which in turn has a very negative impact on the runtime of the algorithm.

If the limit is regarded, however, the algorithm can be run.42 Figure 38 shows an examplewith pedantic set to 1.


Figure 37: Matlab code which tries to use algorithm 1.2 but fails (pedantic=0)

n = 1000000;m = 90;p = 0.9;epsilon = 0.7;Sigma = uint8([0:255]);X = [999601][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_12(S, T, p, Sigma, length(X), epsilon, 1)


Review

As mentioned above, the upper limit on m is very strict and cannot be set aside. When analysingit for varying n in the same context as before (i.e. with ε = 0.1, p = 0.9, t = 2), we observe thatonly small block sizes are allowed with this algorithm. Figure 39 shows the corresponding plot.

However, even if the limit on m is met, the run time and memory usage of the Matlabimplementation of algorithm 1.2 is unfortunately rather inappropriate.43 The main reason forthis is that β pi + t easily turns out to be several thousand. Thus, thousands of largest valuesare extracted from vectors C(i). Since solving according to the Chinese Remainder Theorem isnot optimised in Matlab, calculating the solutions for all combinations is a very slow processand clearly the bottleneck of our implementation. Colin Percival shows in [1] on page 22 thatthe seemingly quadratic run time O((βL + t)2) of this part of the algorithm is actually notquadratic, by definition of β and L. However, even in theory this is questionable: It again relies

42Please note that this example might take several hours to run, and Matlab might run out of memory due to thelarge Cartesian product.

43Specifically, either Matlab runs out of memory or we have given up waiting for results after 8 hours.

42


Figure 39: Limit of m in algorithm 1.2 (ε = 0.1, p = 0.9, t = 2)

on the assumption that primes of size L are used, while their worst case size is roughly L(1 +2/ log(L)). The corresponding difference should not be set aside without further comment,especially because it is in context of a square. If we use a completely different way to reconstructX , which does not tend to have a quadratic run time, this problem will be solved.

2.4.4 The Third Variant

In the third version of the algorithm, the strict limits on m are relaxed, and the size of theprimes is reduced even further. This is done by applying a different theory, namely that ofBayesian analysis. As a result, the method to reconstruct X is also changed, with the followingbackground: Instead of calculating all possible solutions according to the Chinese RemainderTheorem, the elements of the vectors C(i) can be added up for all positions up to n modulo thecorresponding prime. The result is a single vector F with a length of n:

∀ j ∈ {0, . . . , n−1}:

Fj =k

∑i=1

C(i)j mod pi

(2.35)

Spikes which existed e.g. in C(1) at positions modulo p1 and in C(2) at corresponding posi-tions modulo p2 are added up and will lead to spikes in F . Thus, F is actually an approximationof C, and equation (2.35) can be seen as “inverse projection”, because it restores the spikes atpositions where they would have been without the projection. Further processing of F can thenbe done like the processing of C in our reference algorithm.

While this is intuitively a reasonable result, the approximation according to the Bayesiananalysis is a bit different (see [1] on page 24):

43


∀ j ∈ {0, . . . , n−1}:

F ′j =k

∑i=1

C(i)j mod pi

− mp2

σpi(n, m, j)(2.36)

with σ being the standard deviation defined in [1] on page 13 as

σpi(n, m, j) = |{(x, y) ∈ Z×Z : 0≤ x < n, 0≤ y < m, x≡ y+ j (mod pi)}| (2.37)

Using vector F ′ instead of F for further processing leads to problems, however:

1. F ′ depends on p, and if p cannot be predicted or at least estimated, the results will befalsified (see also [1] on page 26). This is a serious problem, because in an applicationwhich is comparing two files, one cannot tell in the first place “how well” these files willmatch.

2. Using equation (2.36) to calculate F ′ will not lead to correct results for maliciouslyformed X (see [1] on page 24). This is more a theoretical problem, because these X

are very unlikely to occur in real applications, but it is still a drawback.

While the best way to solve the first problem is to use F instead of F ′, the second problem canbe dealt with by performing further processing of C(i) ”for some appropriate δ”44 (see [1] onpages 25):∀i ∈ {1, . . . , k}; ∀ j ∈ {0, . . . , pi−1}:

exp(−D(i)

j

)=√

δ + exp

−mp(

C(i)j −

mp2

)2σpi(n, m, j)

⇔ D(i)

j =− log

√δ + exp

−mp(

C(i)j −

mp2

)2σpi(n, m, j)

(2.38)

These vectors D(i) are then added up:∀ j ∈ {0, . . . , n−1}:

F ′′j =k

∑i=1

D(i)j (2.39)

In order to numerically show “what happens” during the calculation of vectors D(i), we define

x =mp(

C(i)j −

mp2

)2σpi(n, m, j)

to be used as variable, set δ = 2 and plot y = − log(√

δ + e−x). The result is shown in figure40.

Interpreting this plot, we observe that x > 0 if and only if C(i)j > mp

2 , due to the restrictionsof the model. This means that, according to the plot, spikes in vectors C(i) larger than mp

2 are

44This is cited from [1] on page 24 to make it clear that δ is a very abstract value.

44


Figure 40: Plot of y =− log(√

δ + e−x) with δ = 2

function [D] = filter_C(n, m, p, t, C)∆ = (t*m*p^2*log(n))/nsize = length(C);D = zeros(1, size, 'single');for j = 1 : size

D(j) = −log(sqrt( ∆)+exp(−(m*p*(C(j)−m*p/2))/(2*2*n*m/size)));end

Figure 41: Matlab function to calculate vector D

truncated. Since y ≈ x for all x ≤ 0, all other values are relatively maintained by this function.Figure 41 shows a Matlab function which calculates a vector D from C, using the definition ofδ as specified later in equation (2.41) and an approximation from [1] on page 28:

σpi(n, m, j)≈ nmpi

(2.40)

Applying this function to a vector C calculated according to figure 17 on page 24, we clearly seethat all values are relatively maintained, but the spikes are truncated. Figures 42 and 43 showvectors C and D, respectively.45

Based on these numerical observations, it is questionable why Colin Percival states in [1]on page 24, while deriving the algorithm, that “D(i)

j = max(C(i)j , δ )”. This is in contrast to the

definition of D(i) in equation (2.38)46, which, to the best of our knowledge, leads to a truncation

45Note that the second figure has been scaled, but the ratio was maintained.46which was cited from [1]

45


Figure 42: Plot of the raw cyclic correlation C

Figure 43: Plot of the filtered cyclic correlation D

of spikes in C(i), thereby showing the behaviour of a “min”-function.47 Due to this discrepancy,

47One might argue that truncating large values respectively increasing small values is basically the same, butthis is questionable. Increasing all values to a certain minimum level would remove some of the random noise.

46


and to avoid too much dependency on p, we decide to stick to our initial approach and calculateF in our implementation as specified in equation (2.35).

What remains to be discussed is, as in the previous variants of the algorithm, the restrictionson input values and the further processing to estimate X :

1. No input restrictions are specified in [1] for this algorithm, so only the restrictions ofthe model apply. The number of primes k and the minimum size of the primes L arecalculated as follows (see [1] on page 24):

δ =tmp2 log(n)

n(2.41)

k =⌈

log(nt/ε)log(1/(4δ ))

⌉(2.42)

L =−8n log

(2k√

ε/nt−√

δ

)mp2 +8log

(2k√

ε/nt−√

δ

) (2.43)

However, these definitions lead to implicit input restrictions, because negative values ofk and L are not applicable. For example, n = 10000, m = 300, t = 3, ε = 0.1 leads tok = −12. We observe that this happens if 4δ > 1, since in that case log(1/(4δ )) < 0.The case 4δ = 1 should also be prevented, as it leads to a division by zero. Therefore, werequire 4δ < 1 and derive an input restriction by ourselves:48

4δ < 1

⇔ 4tmp2 log(n)n < 1

⇔ 4tmp2 log(n) < n

⇔ m <n

4t p2 log(n)(2.44)

This input restriction helps to avoid invalid input values, though we do not claim that itcovers all cases.

2. As already mentioned, the processing of F can be done as the processing of C in sec-tion 2.4.1 on page 30: We simply find the positions of the t largest values of F , whichthen form the estimated X .

We now present a Matlab function implementing “Algorithm 1.3” of [1] (on pages 24-25) withthe changes as described in this section. It is shown in figure 44 on the following page.

Description

The first step of this algorithm is, as in the other variants, to check its input values againstthe limitations of the model and against the specific constraints we defined in equation (2.44).

48In addition to the model definitions, we assume n > 1 and t > 0 to prevent division by zero.

47


function [Xest] = algorithm_13(S, T, p, Sigma, t, epsilon, pedantic)tic% Retrieve lengths.n = length(S);m = length(T);% Check input predicates.check_model_predicates(n, m, p, Sigma, t, epsilon);% Check additional predicates for the algorithm (if pedantic is nonzero).if (pedantic) && (not (n > 1 && n/(4*t*p^2*log(n)) > m))

error('Invalid m for this algorithm.');end% Initialization.∆ = (t*m*p^2*log(n))/nk = ceil(log((n*t)/epsilon)/(log(1/(4* ∆))))L = (−8*n*log((epsilon/(n*t))^(1/(2*k))−sqrt( ∆))) ...

/(m*p^2+8*log((epsilon/(n*t))^(1/(2*k))−sqrt( ∆)))Xest = [];% Randomly select primes.Primes = select_primes(L, k)% Reduce data size by projecting onto subspaces.[A, B] = project_onto_subspaces(S, T, Primes, Sigma); pack;% Calculate the cyclic correlation using the FFT.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i}))); pack;end% Estimate X.F = zeros(1, n, 'single');for j = 0 : n − 1

tmpSum = single(0);for i = 1 : k

tmpSum = tmpSum + C{i}(mod(j, Primes(i)) + 1);endF(j + 1) = tmpSum;

endXest = pos_of_largest_val(F, t)−1;toc


Afterwards, it initialises δ , k and L according to equations (2.41), (2.42) and (2.43). It selectsk primes and then projects the input strings onto subspaces. Then the cyclic correlations C(i) ofvectors A(i) and B(i) are calculated using the FFT. In the next step, vector F is calculated as inequation (2.35) of this section. The set X is estimated by extracting the positions of the t largestvalues of F .

Example

Using algorithm 1.3 is fairly straightforward, since the limit on m is not too restrictive. Figure30 shows an example application of this algorithm similar to the “reference example” in figure28, with pedantic set to 1.

48




Review

The plot of the upper limit of m given ε = 0.1, p = 0.9, t = 2 is shown in figure 46. The limitis very generous, and we can easily select small as well as large block sizes. Consequently, thisvariant of the algorithm is the best candidate to be used for delta compression of executablecode.

Figure 46: Limit on m in algorithm 1.3 (ε = 0.1, p = 0.9, t = 2)

We also note that calculating F does not tend to have a quadratic run time, and is thereforepreferable compared to calculating all combinations of solutions according to the Chinese Re-mainder Theorem. In theory, it is not even necessary to calculate the whole vector F , as onlythe largest values of the sum need to be considered (see [1] on page 24).

49


2.4.5 Comparison

In order to compare the different variants of the algorithm with regard to speed and probabilityof success, we run them several times with the same input data and check whether they aresuccessful, i.e. whether X is estimated correctly. All tests are performed on a Windows XPsystem with Intel Core2 CPU 2.1 GHz and 1 GB RAM, using Matlab version 7.1.49

Table 3 shows the input values and the corresponding test results of 100 runs, given Σ ={0, . . . , 255}. We skip algorithm 1.2 entirely, because it fails due to input restrictions.50

Table 3: Test results using n = 102400, m = 512, p = 0.9, X = {34,39411,101410}

Algorithm Average Run Time Probability of SuccessReference 0.03s 1.01.1 2.32s 1.01.2 (not applicable) (not applicable)1.3 16.88s 1.0

These results are characteristic for large input values with the precondition n� m and m

not being too small in the context of matching blocks of executable code. Given data accordingto the model, the numeric probability to succeed is usually about 1. Small input values lead toproblems with regard to input restrictions.

Comparing the run time clearly shows that the calculation of the FFT is heavily optimisedin Matlab. The more complexity is added outside of the FFT, the slower is the run time. Thisis especially bad for algorithm 1.3, since we chose it for use in delta compression of executablecode. Therefore, we intend to write a Matlab-independent implementation of this algorithm.

49The tests run considerably slower if GNU Octave is used instead.50Including a test according to the restrictions of algorithm 1.2 does not help, either, because a single run time

is several hours then.

50

Chapter 3

Implementation

In this chapter, we present a C++ implementation of algorithm 1.3 as described in section 2.4.4on page 43. The implementation aims at overcoming the drawbacks of the Matlab functionsconcerning memory usage, speed and reusability to allow further use of this algorithm, forexample in a patch tool.

3.1 Considerations

3.1.1 Portability

When planning to implement the algorithm, we need to consider portability from the start, as itseverely affects the choice of libraries and the basic technical design of the implementation. Inorder to achieve portability across a wide range of operating systems, we decide to use CMake[24] as “build environment”. Thus, we are able to support compiling the program on Windowsand Linux systems as well as BSD-based systems like Mac OS X. This also enables the use ofdifferent compilers on the respective systems.

In order to accomplish portability of the implementation itself, we avoid using system spe-cific calls as much as possible and instead use standard libraries provided by the compiler, orportable third party libraries.

3.1.2 Environment

Although we wish to optimise for speed, we do not have the claim to write optimal code forall subproblems. We do not wish to reinvent the wheel, instead we intend to reuse widespreadand optimised algorithms. While interpreted environments like Microsoft .NET offer variousalgorithms, the programs running in this context usually are considerably slower than corre-sponding native applications. This is why we choose C++ as programming language. C++ isbased on C and offers low level functions, but additionally it provides the “Standard TemplateLibrary” (STL)51. Its “<algorithm>” header supplies various stable and optimised algorithms

51The STL is specified in [17]; for documentation see e.g. [18].

51

3.1. CONSIDERATIONS CHAPTER 3. IMPLEMENTATION

like priority queues, which can easily be used. Table 4 shows a list of libraries which we utiliseto implement the algorithm for matching with mismatches.

Table 4: Base libraries for the C++ implementation

Library Functions/modules to use LicenseFFTW [22] Fast Fourier Transform calculation GPL 2 or higherSTL [17] Vectors, iterators and algorithms (redistributable in binary form)Standard C Lib [16] File operations, math functions (redistributable in binary form)

We have to be aware of the fact that this adds dependencies to our implementation of thealgorithm. Possible tools based on it will also have these dependencies. However, given thecomplexity of the algorithm, we think that these dependencies are justified. They will onlybe present on the side of the patch generation, a tool to apply the patch can perform this taskwithout requiring Fast Fourier Transforms and priority queues.52

Finally, we need to make sure that the third-party libraries we are using integrate wellenough with our implementation, both in terms of license and technology.

3.1.3 Library Integration

In order to efficiently use the FFTW library, the vectors supplied for the calculation of the FastFourier Transform need to be aligned in memory on a 16 byte boundary (see [23] on page 15).With this being the case, FFTW makes use of the SIMD53 instructions, if they are available onthe target system. Using these instructions tremendously improves the actual run time of theFFT, and therefore we should regard this boundary when allocating memory for our vectors.

FFTW provides the function fftw_malloc to allocate memory on a 16 byte boundary.However, this is in conflict to the valarray template class54 of the C++ STL, which is the usualvector type for efficient mathematical calculations. This type automatically allocates memoryand does not guarantee that its data is aligned on a 16 byte boundary. Unfortunately, valarraydoes not support redefining the memory allocation. This leaves us two options: Either we usethe vector template class of the STL and define a so-called “custom allocator”, or we have toimplement our own vector type.

Since it is hard to control when exactly memory reallocations and temporary copies occurwith the vector type, we decide to create our own vector template for use with this algorithm.

Apart from that, the libraries integrate smoothly, for example we are able to use the C++type complex with FFTW without problems.

52The tool applying the patch might still depend on libraries to uncompress the data, but this is beyond thescope of this thesis.

53Single Instruction Multiple Data54A C++ template class is a class which can be used for different base types, e.g. a generic vector for real and

complex value types.

52

CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION

3.2 Implementation

3.2.1 Structure

We start the implementation with the Matlab code of algorithm 1.3 in mind (see figure 44 onpage 48), but also with regard to delta compression of executable code. Some functions used inMatlab are not available in C/C++, so we have to implement them on our own. Table 5 showsthe core modules55 and the corresponding source files of our implementation.

Table 5: Core modules of the C++ implementation of the algorithm

Module Functions Source FileVector type Math functions, resizing, copying rntypes.hSelection of primes Select and return a list of primes prime_numbers.cppCore algorithm Calculate index, perform matching matching_with_mismatches.cppFile input Read input data from files rndiff.cppPost processing Merge adjacent match results rndiff.cpp

In order to implement a patch tool for delta compression of executable files based on thesecore modules, further functions are required. These are listed in table 6. Furthermore, a fileformat needs to be defined. The implementation of these additional modules is beyond thescope of this thesis.

Table 6: Additional modules needed for delta compression

Module FunctionsBoundary processing Check and adjust boundaries of the results (see [1] on page 35)Encoding of differences Encode the actual differences (see [1] on pages 38-41)

3.2.2 Vector Type

The main reason for implementing a custom vector type is that we need to use the memoryallocation and deallocation functions of FFTW. In addition to that, the vector type offers func-tions for copying, resizing, applying mathematical functions (e.g. finding the maximum value)and allows direct access to its elements. The vector is implemented as template class, therebyallowing use with integers as well as real and complex types.

3.2.3 Selection of Primes

We need to select primes as in the Matlab function in figure 21 on page 27. Generally, thiscan be done by implementing a function to calculate the primes “on the fly” using the Sieve ofEratosthenes (see e.g. [7] on pages 29-30). However, that would unnecessarily cost executiontime. Instead, we set a limit on the size of input files, i.e. we disallow files larger than 2 GB,

55A module is a pool of functions which build a logical unit.

53

3.2. IMPLEMENTATION CHAPTER 3. IMPLEMENTATION

and pre-calculate all primes which we possibly need for the valid input range. These primesare stored in a static array directly in the source code. Thus, choosing primes is very efficientregardless of the interval.

In order to select primes from a given interval, we first find the boundaries of that intervalwithin our array of pre-calculated primes using the STL algorithm lower_bound. Next wecreate a random permutation of all primes within the boundaries by utilising the STL algorithmrandom_shuffle, and then return the first k of these primes as vector of integers.

3.2.4 Core Algorithm

In the implementation of the core algorithm, all values which are independent of the specificcharacters of T are pre-calculated. These values form the index of the algorithm, as describedin section 2.3.1 on page 10. The index can be used to efficiently match all blocks of one filewith another file, since the block T will change, while the file S remains constant.

The elements of the index are shown in table 7.

Table 7: Elements of the pre-calculated index of the implementation

Element Descriptionk The number of primes k = 2 is constant (see [1] on page 34).L The minimum size of the primes is calculated as L = 4

√n log(n)

(see [1] on page 34).pi ∀i ∈ {1, . . . , k} The primes are selected using the module described in the

previous section.φi ∀i ∈ {1, . . . , k} The functions φi are randomly created for the alphabet

Σ = {0, 1, . . . , 255} with the help of the STL functionrandom_shuffle.

A(i) ∀i ∈ {1, . . . , k} The vectors A(i) are calculated from S according to equation(2.22) on page 29.

FFTW “plans” FFTW is initialised to perform measurements and set up themost efficient way to calculate the FFT on the target system. Theresults of this step are so called “plans” which tell FFTW how toexecute the FFT for vectors of specific lengths (see also [23] onpages 22-23).

After computing the index, the actual matching of a specific block T with S can be per-formed. It consists of the following steps:

1. The vectors B(i) are calculated from T according to equation (2.23) on page 29.

2. The cyclic correlations C(i) are computed by use of FFTW functions56 with the plansprovided from the index.

56specifically, using the single precision real-to-complex fft

54

CHAPTER 3. IMPLEMENTATION 3.2. IMPLEMENTATION

3. Using the STL priority_queue, the t largest elements of the sum

k

∑i=1

C(i)j mod pi

for all j ∈ {0, . . . , n−1} are calculated (see also equation (2.35) on page 43). In order tosave memory, we do not compute the vector F . Instead, we maintain a list of positions ofthe t largest values while calculating the upper sum.

4. The resulting positions are returned as vector of integers.

These steps can be repeated for different blocks T without recreating the index. However, theblock size m needs to be kept constant.

3.2.5 File Input

The Matlab applications of the algorithm are all based on randomly generated input accordingto the model. In contrast to that, the C++ implementation provides core modules specificallyfor delta compression of executable code. Hence, we need to be able to read input data fromexecutable files. This module implements the corresponding functions:

1. First, the size of the two input files is determined.

2. Next, the original file, which is assumed to be present on the target system, is read asstring S, with n being the size of that file.

3. Given S, n and the block size m =√

n log(n) (according to [1] on page 34), the index iscreated as described in the previous section.

4. The new file including e.g. a security fix is read in blocks of length m, and each block ismatched with the original file. For each block only the best match is considered (t = 1).

The result is a mapping of positions in the new file to corresponding positions of similar blocksin the original file.

3.2.6 Post Processing

Based on the resulting mapping from the previous step, we need to perform processing suchthat this mapping can be efficiently used by a tool for delta compression of executable code.Since the two executable files which are compared are assumed to be similar, we are very likelyto encounter regions larger than the block size m which match (with mismatches). Therefore,we merge adjacent blocks in the new file which are mapped to corresponding adjacent blocks inthe original file to larger blocks. In case the two files were equal, this would result in the wholenew file being mapped as one large block to the whole original file.57

57given that our algorithm correctly identifies all matches

55

3.3. FURTHER STEPS CHAPTER 3. IMPLEMENTATION

3.3 Further Steps

We name the C++ implementation of the algorithm “rndiff”. It is available at

http://sourceforge.net/projects/rndiff

First tests of the implementation show that the run time is slower than that of the Matlab ref-erence algorithm presented in section 2.4.1 on page 30 for similar input values. This is due tothe fact that neither the projection (specifically the calculation of vectors B(i)) nor the compu-tation of the sum to reconstruct the results are optimised in the C++ implementation. If the runtime turns out to be too slow for specific applications, further efforts will have to be put intooptimising these parts.

We also note that when applying “rndiff” on two files which are equal, the resulting mappingusually does not consist of one block, as expected. This is due to the fact that the bytes ofexecutable files are in general not uniformly distributed, an issue which will be addressed in thenext chapter.

56

http://sourceforge.net/projects/rndiff

Chapter 4

Improvements

Small improvements and extensions concerning the algorithm were already integrated directlyin the corresponding sections, if appropriate. In this chapter additional improvements, which arebeyond the scope of other parts of this thesis, are presented. Theoretical changes are consideredas well as improvements concerning the practical usage of the algorithm, with special regard toreproducibility of results.

4.1 Theorem 1.1

Although the third variant of the algorithm (see section 2.4.4 on page 43) has its general ideabased on theorem 1.1 of [1], it no longer uses the corresponding probability bound. This sug-gests that the probability bound is not “well enough” to be used for a projection with smallprimes. This sections deals with this problem.

4.1.1 Selecting Primes Randomly

The first thing we note in theorem 1.1 is that the primes are selected “uniformly at random”from the available set of primes (see [1] on page 10). Thus, the same prime can be selectedtwice or more. Since for each prime invalid results are “filtered”, this leads to “less filtering”.In other words, once we have the knowledge that two primes are equal, this is actually the samesituation as if we had one less prime, and the probability bound decreases accordingly.

However, if we create a random permutation of the set of available primes and select thefirst k primes of this permutation, we will prevent selecting the same prime more than once.Especially for small values of L this is a large improvement.

4.1.2 Numerical Probability

Next, we take a closer look at the lower probability bound of theorem 1.1. We numerically showthat there are cases where the reconstruction works with high numerical probability, while thecorresponding probability bound is actually negative58. In order to do this, we provide a Matlab

58i.e. worse than the possible range of probabilities

57

4.1. THEOREM 1.1 CHAPTER 4. IMPROVEMENTS

function implementing the “filtering by intersecting” as in theorem 1.1. The function is shownin figure 47.

function [Xest] = reconstruct_X(n, L, k, X)if (not (sum(ismember(X, [0:n−1])) == length(X)))

error('Invalid estimation value.');endPrimes = select_primes(L, k)% We cannot use Z as whole set numerically.% Instead, we use simple bounds.Xest = [0:n−1];Z = [−(n−1):n−1];for i = 1 : k

pZ = Primes(i) * Z;XpZ = [];for x = X

XpZ = union(XpZ, x + pZ);endXest = intersect(Xest, XpZ);

end

Figure 47: Matlab function to reconstruct X as in theorem 1.1 of [1]

We run several numeric tests, repeating each test 1000 times. Only exact reconstructionscount as successes. Table 8 shows the input values, the numerical probability of correct recon-struction, and the corresponding lower bound according to theorem 1.1 (see equation (2.20) onpage 26).

Table 8: Numeric tests on probability of reconstructing X

Input Values Numerical p Lower Boundn = 100000, L = 1000, k = 2, X = {99578, 99913, 99967} 0.57 −5691.3n = 100000, L = 1000, k = 2, X = {10, 1000, 80000} 0.69 −5691.3n = 10000, L = 1000, k = 2, X = {9578, 9913, 9967} 0.94 −363.31n = 10000, L = 1000, k = 2, X = {10, 1000, 8000} 0.98 −363.31n = 5000, L = 1000, k = 2, X = {4578, 4913, 4967} 0.97 −154.77n = 5000, L = 1000, k = 2, X = {10, 1000, 4000} 1 −154.77n = 5000, L = 1000, k = 6, X = {4578, 4913, 4967} 1 0.85n = 5000, L = 1000, k = 6, X = {10, 1000, 4000} 1 0.85

58

CHAPTER 4. IMPROVEMENTS 4.1. THEOREM 1.1

The results suggest that the lower probability bound is very pessimistic. This has severalreasons:

• Colin Percival states in the proof of the theorem 1.1 (see [1] on pages 10-11) that theproduct

t

∏j=1

(y− x j) (4.1)

with 0≤ y≤ n−1 and x j ∈ X ∀ j ∈ {1, . . . , t} is bounded from above by nt – this is true,but it is an overly pessimistic bound. Actually it cannot be reached even for t = 1 incombination with worst case values:

x1 = 0, y = n−1⇒t

∏j=1

(y− x j) = n−1 < n1

Thus, a tighter (but less trivial) upper bound of the product in equation (4.1) is:

t

∏j=1

(n− j) =(n−1)!

(n− t−1)!(4.2)

By definition of X in the model (see 2.3.1 on page 10), “X = {x1, . . . , xt}⊂{0, . . . , n−m}with xi ≤ xi+1−m for 1 ≤ i < t”, meaning that the differences between two elements ofX is at least m. This leads to an even tighter upper bound of the product:

t

∏j=1

(n− jm) (4.3)

• The proof of the probability bound of theorem 1.1 is based on other worst case assump-tions, and the probability that worst case values are chosen is very small. However, theprobability bound is meant to be true for all inputs, i.e. it has no further assumptionsabout the context.

To conclude, we can achieve a direct improvement of theorem 1.1 if we use a random per-mutation of primes instead of selecting them uniformly at random. This change is alreadyimplemented in the Matlab function in figure 21 on page 27. There is further potential forimprovement by using a different approach in the proof of theorem 1.1.

These results show why we observed that algorithm 1.1 works well outside its proven limits.

59

4.2. DERANDOMISATION CHAPTER 4. IMPROVEMENTS

4.2 Derandomisation

One problem with the usage of the algorithm presented in chapter 2 is that its result can bedifferent each time we apply the algorithm. Especially when the algorithm is used in a tool tocreate patches, a certain reproducibility is expected by the users so that they trust the program.

Generally, there are two possibilities to achieve this:

1. The result is “guaranteed” following the law of large numbers, regardless of the randomselection.

2. Instead of randomly choosing values, these values are calculated according to rules and/oradditional input values.

The first of these options is quite impossible to achieve, given the nature of the algorithm. Wewould need to repeatedly run the algorithm, and even then we can only state that the sameresult will be given most of the time. Unfortunately, if a different result is returned only once, auser might lose trust in the product. Any application concerning security patches should outputreproducible results.

This leaves the second option as only alternative. We have to think of rules for choosing thevalues as well as possible input parameters which can be used to “tweak” these rules.

4.2.1 Selecting Primes Non-Randomly

When selecting the primes from the interval [L, L(1+2/ log(L))), it cannot be decided in gen-eral whether certain primes are good for use with the algorithm or not. Even for specific input,we cannot easily estimate the quality of the primes. We could, for example, run the algorithmseveral times with different primes and compare the results, but that would severely increasethe run time.

Given that executable files often contain local repetition of characters (e.g. due to zeropadding), one option is to select primes with maximum difference to avoid projections whichare partly similar. A different option is to purposely select large or small primes in order toinfluence speed or quality of the algorithm. Small primes will basically lead to faster execution,and large primes can lead to better results because less values are added up during the projection.However, as far as the filtering according to equation (2.19) on page 26 is concerned, no basisfor a quick evaluation of the primes is available.

4.2.2 Creating φ Non-Randomly

Instead of creating the function φ from random sets as in equation (2.21) on page 29, we canpurposeful create φ such that it is convenient for specific input values which are present.

The random creation of φ is based on the assumption that the input characters are selectedindependently and uniformly from the alphabet. This is usually not the case in executablefiles: Certain commands occur more frequently, and due to alignments and initialisation data,

60

CHAPTER 4. IMPROVEMENTS 4.2. DERANDOMISATION

parts of the file are filled with zeros. Additionally, some executable files are statically linked toresources, e.g. bitmaps. Given that this depends on the type of executable file and the underlyingoperating system, we cannot assume a specific probability distribution of the alphabet for allexecutable files. But we can calculate the numeric probability distribution of a specific file ina single pass by counting the characters. The result can be used to calculate φ , such that φ

actually maps half of the input values to 1 and the other half to−1 for non-random input values,as was before only the case when using uniformly distributed input.

Specifically, we can create the function φ as follows: Given the probability distribution, wecalculate the first bit of the Huffman code (see e.g. [13] on pages 99-113) for each character ofΣ. Afterwards, we map φ to −1 for all characters with 0 as first bit of the Huffman code, andto 1 for all characters with 1 as first bit. Due to the basic properties of the Huffman code, thisusually59 leads to a fair distribution of the mapping between −1 and 1.

If we use the original file without the security fix for the calculation, we do not even needto store the probability distribution in the patch file, because this file is present on the targetsystem.60

59If, for example, a certain character has a probability of occurence larger than 0.5, a fair distribution cannot beachieved by φ . In that case, further pre-processing of the input string could be done, but this is beyond the scopeof this thesis.

60However, we should make sure that the correct file is used, for example by checking a cryptographic hash.

61

Chapter 5

Conclusion

In the present thesis, we have explained the background behind the algorithm for matchingwith mismatches from [1], with regard to delta compression of executable code. Supportedby numerical examples, we have critically analysed the different variants of the algorithm andimplemented them in Matlab.

Based on the Matlab code, we have created a C++ implementation of the last variant ofthe algorithm, specifically optimising for speed and memory usage. This implementation canbe used as base for a tool that creates compact security patches. Further improvements withregard to reproducibility have been proposed, such that the algorithm is able to produce constantoutput, given constant input.

First tests show that the run time of all variants of the algorithm in practice severely dependson an efficient implementation of its inner loops and the sub-algorithms it uses. Since the FFTWlibrary is heavily optimised, we have yet to find a numerical example where one of the variantsof the algorithm is faster than our reference algorithm, which mostly relies on the FFT.

It seems hard to beat FFTW in terms of speed, even though the last variant of the algorithmhas a run time “sublinear in n” (see [1] on page 9). The O(n log2(n)) run time of the FFTis only significant for very large values of n, because FFTW keeps the constant parts of therun time very low. In contrast to that, the non-FFT elements of the algorithm are not easy tooptimise. Therefore, the practical run time of these parts outweighs the theoretical advantage.Additionally, the theoretical run time of the algorithm is only asymptotic for growing values ofn.

That being said, the main advantage of the algorithm, in particular of the C++ implementa-tion, is its low memory usage. In order to achieve a speed near that of the reference algorithmfor usual file sizes of a few megabytes, more efforts need to be devoted to optimisation.

62

Appendix

Additional Matlab Functions

Functions used to create the plot in figure 7 on page 13

function [stat] = stat_good_match_not_in_X(n, max_m, p, Sigma, X, runs)count = [0, 0];stat = zeros(1, max_m);% count matches for each m up to max_mfor m = 1 : max_m

count = sum_good_matches_not_in_X(n, m, p, Sigma, X, runs);stat(m) = count(1) / count(2);

end

Figure 48: Matlab function to provide numeric probability on a good match not in X

function [counter] = sum_good_matches_not_in_X(n, m, p, Sigma, X, runs)counter = [0, 0];for i = 1 : runs

[S, T] = construct(n, m, p, Sigma, X);V = match(S, T);counter = counter + count_good_matches_not_in_X(n, m, X, V);

end

Figure 49: Helper function to add up good matches not in X

63

Appendix

function [counter] = count_good_matches_not_in_X(n, m, X, V)% Check V for "good" matches not in X.notX = setdiff([0:n−m], X);% Format of counter: [good_matches_not_in_X, total_count]counter = [0, length(notX) * length(X)];for i = notX

for j = Xif (not (V(i+1) < V(j+1)))

counter(1) = counter(1) + 1;end

endend

Figure 50: Helper function to count good matches not in X

Function to calculate the cyclic correlation C using the FFT

function [C] = match_cyclic_correl_fft(S, T, Sigma)% Retrieve lengths.n = length(S);m = length(T);% Sigma should be a set and sorted. We assume it to be continuous.alphabetSize = length(Sigma);if (not (alphabetSize == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');end% Check input predicates.if (not (m < n && mod(alphabetSize, 2) == 0))

error('Invalid input value.');endSigma_base = double(Sigma(1));% Calculate phi.tmp_phi = ones(1, alphabetSize, 'single');for j = 1 : alphabetSize/2

tmp_phi(j) = single(−1);end% Use a random mapping.phi = intrlv(tmp_phi, randperm(alphabetSize));% Convert S and T to A and B.A = zeros(1, n, 'single');B = zeros(1, n, 'single');for j = 0 : n − 1

A(j + 1) = phi(double(S(j + 1)) − Sigma_base + 1);if (j < m)

B(j + 1) = phi(double(T(j + 1)) − Sigma_base + 1);end

end% Calculate the cyclic correlation using the fft.C = ifft(fft(A) .* conj(fft(B)));

Figure 51: Matlab function to compute C as in figure 8 by using the FFT as in figure 9

64

Appendix

Function to calculate the positions of the t largest values

function [P] = pos_of_largest_val(X, n)% The convention is not call−by−reference, so we are not modifying X.% This function might return the same position multiple% times, given malicious input. But it is fine for our use.min_value = min(X);P = zeros(1, n);for i = 1 : n

[a, b] = max(X);P(i) = b;X(b) = min_value;

end

Figure 52: Matlab function to simply calculate the positions of the t largest values

Functions to calculate the Cartesian product

function [out] = cartesian_prod(cellin, n)if (not (n == 2)) % use recursive function if n != 2

out = cartesian_prod_recursive(cellin, n);else % use optimized function if n == 2

len1 = length(cellin{1});len2 = length(cellin{2});out = cell(1, len1 * len2);for i = 1 : len1

for j = 1 : len2out{j + (i − 1) * len2} = [cellin{1}(i), cellin{2}(j)];

endend

end

Figure 53: Matlab function to iteratively calculate the Cartesian product (2 dimensions)

function [out] = cartesian_prod_recursive(cellin, n)if (n == 1)

for i = 1 : length(cellin{1})out{i} = cellin{1}(i);

endelse

tmpOut = cartesian_prod_recursive(cellin, n−1);k = 1;for i = 1 : length(cellin{n})

for j = 1 : length(tmpOut)out{k} = [tmpOut{j}, cellin{n}(i)];k = k + 1;

endend

end

Figure 54: Matlab function to recursively calculate the Cartesian product

65

Bibliography

[1] Colin Percival: Matching with Mismatches and Assorted Applications, D.Phil. thesis, Uni-versity of Oxford 2006.

[2] Colin Percival: Naive differences of executable code, http://www.daemonology.net/bsdiff/, 2003.

[3] Gonzalo Navarro: A Guided Tour to Approximate String Matching, ACM ComputingSurveys, 33 (1): 31-88, 2001.

[4] Giovanni Motta, James Gustafson, Samson Chen: Differential Compression of ExecutableCode, In Proceedings of the 2007 Data Compression Conference, Pages 103-112, 2007.

[5] David Salomon: Data Compression, The Complete Reference, 4th Edition, Springer 2007.

[6] Thomas H. Cormen, Charles E. Leierson, Ronald L. Rivest, Clifford Stein: IntroductionTo Algorithms, 2nd Edition, The MIT Press 2001

[7] Manfred R. Schroeder: Number Theory in Science and Communication, 4th Edition,Springer 2006.

[8] Athanasios Papoulis: Probability and Statistics, Prentice-Hall 1990.

[9] John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman: Automata Theory, Languages, andComputation, 3rd Edition, Addison-Wesley 2007.

[10] Leo I. Bluestein: A Linear Filtering Approach to the Computation of Discrete FourierTransform, IEEE Transactions on Audio and Electroacoustics, AU-18 (4): 451-455, 1970.

[11] Hari Krishna: Digital Signal Processing Algorithms: Number Theory, Convolution, FastFourier Transforms, and Applications, CRC Press 1998.

[12] William H. Press, Saul A. Teukolsky, William T. Vetterling: Numerical Recipes in C: TheArt of Scientific Computing, 2nd Edition, Cambridge University Press 1993.

[13] André Neubauer: Informationstheorie und Quellencodierung: Eine Einführung für Inge-nieure, Informatiker und Naturwissenschaftler, J. Schlembach Fachverlag 2006.

[14] André Neubauer: Kanalcodierung: Eine Einführung für Ingenieure, Informatiker undNaturwissenschaftler, J. Schlembach Fachverlag 2006.

67

http://www.daemonology.net/bsdiff/

http://www.daemonology.net/bsdiff/

[15] Microsoft Corporation: Using Binary Delta Compression (BDC)Technology to Update Windows Operating Systems, BDC_v2.doc,http://www.microsoft.com/downloads/details.aspx?FamilyID=

4789196c-d60a-497c-ae89-101a3754bad6&displaylang=en, 2005.

[16] ISO/IEC 9899:1990(E), Programming Languages – C (ISO C90 and ANSI C89 standard),1990.

[17] ISO/IEC 14882:1998(E), Programming Languages – C++ (ISO and ANSI C++ standard),1998.

[18] Hewlett-Packard Company: Standard Template Library Programmer’s Guide, http://www.sgi.com/tech/stl/, 1994.

[19] The Mathworks, Inc.: MATLAB, http://www.mathworks.com/, Version 7.1.

[20] John W. Eaton and others: GNU Octave, http://www.gnu.org/software/octave/,Version 3.

[21] Octave-Forge: Extra Packages for GNU Octave, http://octave.sourceforge.net/,Version 20080216.

[22] Matteo Frigo, Steven G. Johnson: FFTW, http://www.fftw.org/, Version 3.1.2.

[23] Matteo Frigo, Steven G. Johnson: FFTW Documentation, http://www.fftw.org/

fftw3.pdf, for Version 3.1.2, 2006.

[24] Kitware, Inc.: CMake, http://www.cmake.org/, Version 2.6.1.

68

http://www.microsoft.com/downloads/details.aspx?FamilyID=4789196c-d60a-497c-ae89-101a3754bad6&displaylang=en

http://www.microsoft.com/downloads/details.aspx?FamilyID=4789196c-d60a-497c-ae89-101a3754bad6&displaylang=en

http://www.sgi.com/tech/stl/

http://www.sgi.com/tech/stl/

http://www.mathworks.com/

http://www.gnu.org/software/octave/

http://octave.sourceforge.net/

http://www.fftw.org/

http://www.fftw.org/fftw3.pdf

http://www.fftw.org/fftw3.pdf

http://www.cmake.org/

delta compression of executable code - analysis, implementation and application-specific...

Documents