high performance deep packet inspection · duces an algorithm for scanning traﬃc compressed by...

The Raymond and Beverly Sackler Faculty of Exact Sciences

The Blavatnik School of Computer Science

High Performance Deep Packet Inspection

Thesis submitted for the degree of Doctor of Philosophy

by

Yaron Koral

This work was carried out under the supervision of

Professor Yehuda Afek and Doctor Anat Bremler-Barr

Submitted to the Senate of Tel Aviv University

September 2012

© 2012

Copyright by Yaron Koral

All Rights Reserved

This work is dedicated to the pursuit of a safe and secure world.

Acknowledgements

First and foremost, I would like to thank my advisors, Yehuda Afek and Anat

Bremler-Barr, for their continued support and guidance throughout my Ph.D. I have

learned a lot from you whether in doing research, writing papers or giving presentations.

Above all you taught me how to walk in the world of science and think sharply.

I had the pleasure of working with the following people: David Hay, Yotam Har-

chol, Shimrit Tzur-David and Victor Zigdon. I thank you for your companionship and

support. Working with you was both enriching and a great delight.

Last, and certainly the most, I thank my family: my beloved wife Keren; my charm-

ing kids Omer, Ofri, Romi and Yarden; and my parents Akiva and Rahel for their

unfailing love, encouragement and support.

The work in this thesis was partially supported by the European Research Council

under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC

Grant agreement no 259085.

Abstract

Deep packet inspection (DPI) is a form of network packet filtering that can search

the packet’s content and locate the presence of certain patterns. These include headers

and data-protocol structures as well as the payload of the message. It enables advanced

network management, user service, and security functions as well as Internet data min-

ing, eavesdropping, and censorship. It is currently being used by enterprises, service

providers, and governments in a wide range of applications.

DPI may be implemented by a wide range of pattern matching algorithms. The

general problem of pattern matching is considered fundamental in computer science

and has been researched thoroughly over the last decades. Still, when applied to the

network domain of recent years, the traditional algorithms fail to face current challenges.

The first challenge is the continual increase in Internet traffic rates, which requires a

scalable design in terms of speed and memory usage. The second challenge arises from

the increase in Web traffic compression due to the increasing popularity of Web surfing

over mobile devices. The security device is forced to decompress this traffic prior to

inspection, leading in turn to processing and space penalties. The third challenge is due

to the requirement for a solution that is resilient to attacks that overload the security

device. We address these challenges here. Moreover, we apply several technological

advances to boost the performance of the traditional algorithms, including, for example,

the presence of Ternary Content Addressable Memory (TCAM) elements in network

devices and the availability of multi-core platforms for the DPI task.

The work presented in this thesis focuses on DPI algorithms and techniques that

relate to network security elements. In Chapter 3, we provide an algorithm for a scalable

design of a DPI engine. Our design reduces the problem of pattern matching to the

well-studied problem of Longest Prefix Match (LPM), which can be solved either in

TCAM, in IP-lookup chips, or in software.

Next we deal with the challenge of DPI over compressed traffic. Chapters 4 and 5

focus on reducing the space and time penalties resulting from the compressed traffic.

These works show that, by using the meta-data generated during the compression stage,

pattern matching over compressed traffic can be accelerated significantly as compared

to traditional pattern matching over non-compressed traffic, and that the space penalty

can be reduced by a factor of six as compared to current designs. Chapter 6 intro-

duces an algorithm for scanning traffic compressed by SDCH compression, which is the

compression scheme used by Google. Our design gains a performance boost of over

40%.

Finally, we address the challenge of performing DPI when the system is under denial-

of-service via algorithmic complexity attacks. We provide a system design that takes

advantage of commercial multi-core platforms to efficiently mitigate complexity attacks

of varying intensity.

The algorithms and techniques presented in this thesis provide a suitable DPI solu-

tion that confronts today’s network challenges.

Contents

1 Introduction 1

1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 CompactDFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 SOP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 SPC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.4 SDCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.5 MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 DFA Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.3 DPI Using Multi-Core Platforms . . . . . . . . . . . . . . . . . . 16

1.3.4 Denial-of-Service Mitigation . . . . . . . . . . . . . . . . . . . . . 17

1.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Background 19

2.1 DFA based Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Gzip Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 SDCH Compression . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Complexity attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 CompactDFA 29

3.1 The CompactDFA Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 CompactDFA Output . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 CompactDFA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 30

ix

3.1.3 The Aho-Corasick Algorithm-like Properties . . . . . . . . . . . 32

3.1.4 Stage I: State Grouping . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.5 Stage II: Common Suffix Tree . . . . . . . . . . . . . . . . . . . . 36

3.1.6 Stage III: State and Node Encoding . . . . . . . . . . . . . . . . 38

3.2 CompactDFA for total memory minimizations . . . . . . . . . . . . . . . 40

3.3 CompactDFA for DFA with strides . . . . . . . . . . . . . . . . . . . . . 41

3.4 Implementing CompactDFA using IP-lookup Solutions . . . . . . . . . . 43

3.4.1 Implementing CompactDFA with non-TCAM IP-lookup solutions 44

3.4.2 Implementing CompactDFA with TCAM . . . . . . . . . . . . . 45

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Space Efficient DPI of Compressed Web Traffic 57

4.1 SOP Packing technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Buffer Packing: Swap Out of boundary Pointers (SOP) . . . . . . 58

4.1.2 Huffman Coding Scheme . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.3 Unpacking the Buffer: Gzip Decompression . . . . . . . . . . . . 63

4.2 Combining SOP with ACCH algorithm . . . . . . . . . . . . . . . . . . . 64


4.3.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 69

4.3.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.3 Space and Time Results . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.4 Time Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.5 DPI of Compressed Traffic . . . . . . . . . . . . . . . . . . . . . . 73

5 Shift-based Pattern Matching for Compressed Traffic 75

5.1 The Modified Wu-Manber Algorithm . . . . . . . . . . . . . . . . . . . . 75

5.2 Shift-based Pattern matching for Compressed traffic (SPC) . . . . . . . 78


5.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.2 Pattern Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.3 SPC Characteristics Analysis . . . . . . . . . . . . . . . . . . . . 83

5.3.4 SPC Run-Time Performance . . . . . . . . . . . . . . . . . . . . 85

5.3.5 SPC Storage Requirements . . . . . . . . . . . . . . . . . . . . . 86

6 Decompression-Free Inspection 88

6.1 Our Decompression-Free algorithm . . . . . . . . . . . . . . . . . . . . . 88

6.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.1.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1.4 Dealing with Gzip over SDCH . . . . . . . . . . . . . . . . . . . . 97

6.2 Regular Expressions Inspection . . . . . . . . . . . . . . . . . . . . . . . 97


7 MCA2 104

7.1 Snort Cache-Miss Complexity Attack . . . . . . . . . . . . . . . . . . . . 104

7.2 The MCA2 System Description . . . . . . . . . . . . . . . . . . . . . . . 106

7.2.1 MCA2 Design overview . . . . . . . . . . . . . . . . . . . . . . . 106

7.2.2 Cross-Thread Communication Mechanism . . . . . . . . . . . . . 109

7.2.3 Thread Allocation Scheme . . . . . . . . . . . . . . . . . . . . . . 111

7.2.4 Flow Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3 MCA2 for Cache-Miss Attacks . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4 MCA2 for Active-States Attacks . . . . . . . . . . . . . . . . . . . . . . . 116


7.5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 118

7.5.2 Cache-Miss Attack Simulation Results . . . . . . . . . . . . . . . 120

7.5.3 Active-State Attack Simulation Results . . . . . . . . . . . . . . 123

8 Conclusion 124

Bibliography 125

List of Tables

3.1 Statistics of the pattern sets used in Section 3.5 . . . . . . . . . . . . . . 52

3.2 Summary of experimental results for Snort and ClamAV pattern sets . . 54

4.1 Comparison of Time and Space parameters of different algorithms . . . . 69

4.2 Overview of pattern matching with gzip processing . . . . . . . . . . . . 73

5.1 Storage Requirements (KB) . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.1 Step by step execution of our algorithm on the example of Section 6.1.1 92

7.1 No-drop setting parameters . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 The non-common states ratio . . . . . . . . . . . . . . . . . . . . . . . . 116

7.3 Validation of the thread allocation model of Section 7.2.3 . . . . . . . . 122

xii

List of Figures

1.1 The goodput of MCA2 for different attack intensities . . . . . . . . . . . 13

2.1 Example of an Aho-Corasick DFA and methods to store it in memory . 20

2.2 LZ77 example on Yahoo! home page . . . . . . . . . . . . . . . . . . . . 24

3.1 Aho-Corasick DFA toy example . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Illustration of the intra-flow interleaving on a single packet . . . . . . . . 48

3.3 Expansion factor under Truncated CompactDFA . . . . . . . . . . . . . 53

3.4 The distribution of the values C . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 The latency of inter-flow and intra-flow interleaving . . . . . . . . . . . . 56

4.1 Sketch of the gzip 32KB memory buffer . . . . . . . . . . . . . . . . . . 58

4.2 Sketch of the memory buffer in different scenarios . . . . . . . . . . . . . 64

4.3 Illustration of common terms . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Sketch of the memory buffer including the Status Vector . . . . . . . . . 68

4.5 HTTP Compression usage among the Alexa top-site lists . . . . . . . . . 70

5.1 MWM algorithm example . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Pointer scan procedure example . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Skipped Character Ratio (Sr) . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 Normalized Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Example of an Aho-Corasick Automaton . . . . . . . . . . . . . . . . . . 89

6.2 The depth of first three states of each failure path . . . . . . . . . . . . . 101

6.3 Comparison between the scan-ratio and compression-ratio . . . . . . . . 102

6.4 Comparison when considering also regular expression matching . . . . . 103

7.1 The effects of a cache-miss attack . . . . . . . . . . . . . . . . . . . . . . 105

xiii

7.2 Illustration of MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3 Sketch of a record in the bad packet queue . . . . . . . . . . . . . . . . . 110

7.4 Distribution of cache-misses under normal traffic and under attack. . . . 114

7.5 CDF of the percentage of normal traffic packets . . . . . . . . . . . . . . 115

7.6 The total system throughput for a different number of common states . 116

7.7 CDF of the percentage of mild attack . . . . . . . . . . . . . . . . . . . . 117

7.8 Distribution of maximal average number of active states . . . . . . . . . 119

7.9 Average throughput per thread over time . . . . . . . . . . . . . . . . . . 121

7.10 Goodput of Hybrid-FA and of Hybrid-FA with MCA2 full-drop setup . . 122

Chapter 1

Introduction

Deep packet inspection (DPI) consists of inspecting both the packet header and pay-

load and alerting the system when signatures of malicious software appear in the traffic.

These signatures are identified through pattern matching algorithms that are classified

either as string matching, in which the patterns are a set of strings, or regular expression

matching, in which the patterns are defined as regular expressions. DPI is a basic ele-

ment in today’s security tools, such as Network Intrusion Detection/Prevention System

(NIDS/NIPS) or Web application firewall, which are used to detect malicious activi-

ties. Moreover, DPI and its corresponding pattern matching algorithms are also crucial

building blocks for other networking applications such as traffic monitoring and HTTP

load-balancing. Today, the performance of security tools is dominated by the speed of

the underlying pattern matching algorithms [49].

Both string matching and regular expression matching are fundamental problems

in computer science and have been a topic of intensive research for decades. In what

follows, we provide a brief description of the main approaches to these problems and

explain why they are not adequate for contemporary needs.

The fundamental string matching paradigm derives from the Aho-Corasick (AC) [23]

algorithm. This algorithm constructs a deterministic finite automaton (DFA) for de-

tecting all occurrences in any given set of patterns by processing the input in a single

pass, performing a state transition for each input byte. An alternative is the shift-

based paradigm that includes the Boyer-Moore (BM) [32] as well as the modified Wu-

Manber (MWM) [94] algorithms. This paradigm aims at reducing the average-case per-

formance by exploiting a heuristic approach for skipping portions of the input. These

algorithms result in an average sublinear performance.

1

2 CHAPTER 1. INTRODUCTION

Likewise, regular expression matching also has two mainstream approaches; these

are based on either a deterministic or a non-deterministic finite automaton (DFA or

NFA). DFA has superior runtime complexity of O(1), and thus a constant time per bit,

as compared with an O(n2) runtime complexity for NFA. NFA, on the other hand, has

superior space complexity of O(n) linear space, as compared with O(2n) exponential

space for DFA. Complexity is calculated as a function of n, the length of the regular

expression [54].

None of the aforementioned traditional solutions is suitable for coping with today’s

requirements, due to problems with scalability, compressed traffic, and resiliency. We

detail these problems below.

Scalability It is essential to increase the speed and reduce the memory requirements

of the pattern matching solutions. As DPI is performed in the critical path of packet

processing, current solutions must handle network speeds of 10–100 Gbps. Moreover,

the solutions must deal with thousands of patterns. For example, the ClamAV [4] virus-

signature database consists of 61K patterns, and the popular Snort NIDS [9] has more

than 30K patterns. Typically, the number of patterns considered by NIDS systems

grows dramatically over time. The size of the pattern database prohibits the use of a

fast memory such as CPU cache or SRAM. Thus, its memory requirement has a direct

effect on its time performance. Current research has focused on reducing the memory

requirement by compressing the corresponding DFA [24, 45, 85, 88, 89]; however, all

proposed techniques suggest a pure-hardware solution, which usually incurs prohibitive

deployment and development costs.

Compressed Traffic Scalable DPI solutions should support increasing rates of In-

ternet traffic. One method for supporting these rates at network servers is to compress

the ongoing traffic, therefore transferring data more efficiently. This method is used

today for compressing HTTP text when transferring pages over the Web. The sharply

increasing number of compressed Web pages is largely motivated by the increase in Web

surfing over mobile devices. Sites such as Yahoo!, Google, MSN, YouTube, Facebook

and others use HTTP compression to enhance the speed of their content download.

For example, in February 2012, W3Techs published a ranking breakdown report [10],

which shows that 44.7% of the Web sites compress their traffic; when focusing on the

top 1 000 sites, a remarkable 83.4% of the sites compress their traffic. HTTP 1.1 has a

1.1. METHOD 3

standard-based method of delivering this compressed content; therefore it is supported

by all modern browsers.

The presence of compressed traffic makes Internet traffic harder to analyze by the

DPI routine because doing so requires two time-consuming phases: traffic decompression

and pattern matching. Currently, most security tools fail to analyze compressed traffic.

In some cases they simply do not scan compressed traffic, thus compromising their

original goal of detecting malicious activities. In other cases, the security tools ensure

that there is no compressed traffic by rewriting the HTTP header between the original

client and server; this solution leads to a waste of bandwidth and a higher cost per-bit.

Resiliency Increased traffic rates and compressed traffic are considered legitimate

Internet phenomena. Still, NIDS and Firewall, as the security tools that protect against

malicious users, are naturally becoming a favorable target for illegitimate phenomena

such as denial-of-service attacks. A recent trend is a two-phase combined attack on

security devices: the attackers first neutralize the device, for example, by overwhelming

it with traffic, and then, when it has been knocked down, attack the assets it was

protecting. For example, a recent attack on SONY combined a distributed denial-of-

service (DDoS) attack with credit card theft [16]. The combined attacks usually affect

NIDS and NIPS differently. In NIDS, where the stealth mode device only monitors

the traffic and issues alerts when it detects malicious activity, these DDoS attacks may

force the device to stop inspecting part, or all of the traffic, thereby allowing another

attack to pass unnoticed. In-line NIPS, on the other hand, because it inspects packets

on their critical path, might be forced to drop legitimate traffic, causing, in practice, a

denial-of-service on the servers it is supposed to protect. Bro and Snort, for example,

are both vulnerable to this kind of attack [65].

1.1 Method

The research presented in this thesis requires multiple methods from different disciplines.

A major effort was invested to find new algorithms and design approaches for DPI.

We analyze the performance of the algorithms for both the average case, with normal

routine traffic, and the worst case, with traffic from malicious users. The normal traffic

is obtained from either prepared data traces such as the DARPA MIT traces in [5]

or traces collected from our simulation environment using live Internet traffic. The


worst-case traffic traces are usually synthesized with respect to the referred attack.

Since most of today’s security-tools source code and pattern sets are free to the public,

an adversary may devise a tailored attack accordingly. Therefore we consider both

adaptive adversaries that can observe the actions of our proposed scheme and respond

accordingly in real time, and oblivious adversaries that cannot.

In addition, we verify the performance of our proposed scheme with simulations and

experiments. This is especially crucial for heuristics that have no theoretical perfor-

mance guarantees. More specifically, we implement all our software solutions and test

their performance using both synthetic and real pattern sets, including patterns and

regular expressions from contemporary security applications, such as Snort [9], Mod-

Security [6], Bro [2] and ClamAV [4]. Our simulation environment includes multi-core

platforms having different cache architectures, to test their influence on our proposed

algorithms.

1.2 Overview of Results

The following subsections briefly describe five research results presented in this thesis.

1.2.1 CompactDFA: Scalable Pattern Matching using Longest Prefix

Match Solutions

Recently, much effort has been devoted to compressing the Aho-Corasick DFA in order

to improve the algorithm’s performance [28, 63, 64, 66, 76, 87–89, 93, 96]. Other works,

such as [77], focused on the construction of the above compressed DFAs using small

intermediate parsers. While most of these works either suggest dedicated hardware

solutions or introduce a non-constant higher processing time, we present a generic DFA

compression algorithm and show how to store the resulting DFA in an off-the-shelf

hardware. Our novel algorithm works on a large class of Aho-Corasick–like DFAs, whose

unique properties are defined in the following chapter. This algorithm reduces the rule

set to the minimum possible size: only one rule per state. A key observation is that in

prior works the state codes were chosen arbitrarily; we take advantage of this degree of

freedom and add information about the state properties in the state code. This allows

us to encode all transitions to a specific state by a single prefix that captures a set

of current states. Moreover, if a state matches more than one rule, the rule with the

1.2. OVERVIEW OF RESULTS 5

longest prefix is selected. Thus, our scheme reduces the problem of pattern matching

to the well-studied problem of Longest Prefix Match (LPM).

The reduction in the number of rules comes with a small overhead in the number

of bits used to encode a state. For example, a DFA based on Snort requires 17 bits

to encode each state with an arbitrary piece of code, while when using our scheme it

requires 36; for ClamAV’s DFA the code width increases from 22 bits to 59 bits.

In addition, we present two extensions to our basic scheme. Our first extension,

called CompactDFA for total memory minimization, aims at minimizing the product of

the number of rules and the code width, rather than only the number of rules. This

captures situations in which, at some point, the reduction in the number of rules is

not worth the additional bits in the state code. Specifically, for the above pattern

sets, this extension reduced the memory requirement by up to an additional 40%. The

second extension deals with variable-stride DFAs, which were created to speed up the

inspection process by inspecting more than one symbol at a time; like the first extension,

it requires more rules than the number of states.

One of the main advantages of CompactDFA is that it fits into commercially avail-

able IP-lookup solutions, implying that they may also be used for performing fast pat-

tern matching. We demonstrate the power of this reduction by implementing the Aho-

Corasick algorithm on an IP-lookup chip (as in [84]) and on a TCAM.

Specifically, in TCAM, each rule is mapped to one entry. Since TCAMs are config-

ured with entry width that is a multiple of 36 bits or 40 bits, minimizing the number

of bits to encode a state is less important and the basic CompactDFA that minimizes

the number of rules is more adequate.

We also deal with two obstacles that arise when using TCAMs: the power consump-

tion and the latency induced by the pipeline in the TCAM chip, which is especially

significant since CompactDFA works in a closed-loop manner (that is, where the input

for one lookup depends on the output of a previous lookup). To overcome the latency

problem, we propose two kinds of interleaving executions (namely, inter-flow interleav-

ing and intra-flow interleaving). We show that combining these executions provides low

latency (in the order of a few tens of microseconds) at high throughput. We reduce the

power consumption of the TCAM by taking advantage of the fact that today’s vendors

partition the TCAM to blocks and allow, in every lookup, activation of only some of

these blocks. We suggest dividing the rules to different blocks, each is associated with a


different subset of symbols. Dividing the rules to blocks in this way reduces the number

of bits required for encoding the symbol field of the rule to the logarithm of the number

of symbols that are mapped to the same block.

The small memory requirement of the compressed rules and the low power consump-

tion enable the use of multiple TCAMs simultaneously, where each performs pattern

matching over different sessions or packets (namely, inter-flow interleaving). Further-

more, one can take advantage of the common multiprocessor architecture of contempo-

rary security tools and design a high throughput solution, applicable to the common

case of multi-sessions/packets. Notice that while state-of-the-art TCAM chips are 5MB

in size, high throughput may be achieved using multiple small TCAM chips. For the

Snort pattern set, we achieve a throughput of 10Gbps and latency of less than 60

microseconds by using 5 small TCAMs of 0.5MB each, and as much as 40Gbps (and

the same latency) with 20 small TCAMs.

This work was published in the proceedings of Infocom 2010 [35].

1.2.2 Space-Efficient Deep Packet Inspection of Compressed

Web Traffic

Networking devices that perform deep packet inspection (DPI) over compressed traffic

need first to decompress the message to inspect its payload. Gzip compression, which is

used for compressed Web traffic, replaces repeated strings with back references, denoted

as pointers, to their prior occurrence within the last 32KB of the text. Therefore, the

decompression process requires a 32KB buffer of the recent decompressed data to keep

all possible bytes that might be back-referenced by the pointers, which causes a major

space penalty. With today’s mid-range firewalls, which are built to support 100K to

200K concurrent connections, keeping a buffer for the 32KB window for each connection

occupies a few gigabytes of main memory. Decompression also causes a time penalty

but this penalty was successfully reduced in [36].

This high memory requirement leaves vendors and network operators with three bad

options: ignore compressed traffic, forbid compression, or divert the compressed traffic

for offline processing. Obviously none of these are acceptable as they present a security

hole or serious performance degradation.

The basic structure of our approach is to keep the 32KB buffer of all connections

compressed, except for the data of the connection whose packet(s) is now being pro-


cessed. When a packet arrives, we unpack its connection buffer and process it. One may

naïvely suggest keeping only the original compressed data as it was received. However,

this approach fails since the buffer would contain recursive pointers to data more than

32KB backwards. Our technique, called “Swap Out-of-boundary Pointers" (SOP), packs

the buffer’s connection by combining recent information from both the compressed and

uncompressed 32KB buffer to create a new compressed buffer that contains pointers

that refer only to locations within itself. We show that by employing our technique on

real-life data we reduce the space requirement by a factor of 5 with a time penalty of

26%. Note that while our method modifies the compressed data locally, it is transparent

to both the client and the server.

We further design an algorithm that combines our SOP space-reducing technique

with the ACCH algorithm presented in [36] (the ACCH algorithm accelerates pattern

matching on compressed HTTP traffic). The combined algorithm achieves an improve-

ment of 42% of the time and 79% of the space requirements. The time-space tradeoff

presented by our technique provides the first solution that enables DPI on compressed

traffic at wire speed for network devices such as IPS and Web application firewall.

This work was published in the proceedings of IFIP Networking 2011 [21]. An

extended version was published in the Computer Communications Journal [22].

1.2.3 Shift-based Pattern Matching for Compressed Web Traffic

In this work we provide a method for accelerating DPI over compressed traffic. The

most common algorithm for compressed traffic uses the gzip compression algorithm,

which eliminates repetitions of strings using back references (pointers). The key insight

is to store information produced by the pattern matching algorithm for scanned decom-

pressed traffic, and in the case of pointers, use this data to either find a match or to skip

scanning that area. Recent work [36] presents the ACCH technique for pattern match-

ing on compressed traffic. This technique decompresses the traffic and then uses data

from the decompression phase to accelerate the process. This work analyzed the case

of using the well-known Aho-Corasick (AC) [23] algorithm as a multi-pattern match-

ing technique. The basic Aho-Corasick has good worst-case performance since every

character requires traversing exactly one deterministic finite automaton (DFA) edge.

However, the adaptation for compressed traffic, where some characters represented by

pointers may be skipped, is complicated since the Aho-Corasick requires inspection of


every byte within the traffic.

Inspired by the insights of that work, we investigate the case of performing DPI over

compressed Web traffic using the shift-based multi-pattern matching technique of the

modified Wu-Manber algorithm [94]. The Wu-Manber algorithm does not scan every

position within the traffic; in fact it shifts (skips) scanning areas in which the algorithm

concludes that no pattern starts.

As a preliminary step, we present an improved version for the Wu-Manber algo-

rithm (see Section 5.1). This modification improves both time and space complexity to

fit the large number of patterns within current pattern sets such as those in the Snort

database [9]. We then present our Shift-based Pattern matching for Compressed traffic

algorithm (SPC ), which accelerates Wu-Manber on compressed traffic. SPC results in

a simpler algorithm, with higher throughput and lower storage overhead than the accel-

erated AC, since the Wu-Manber algorithm basic operation involves shifting (skipping)

some of the traffic. Thus, it is natural to combine Wu-Manber with the idea of shifting

(skipping) some of pointers.

We show that we can skip scanning up to 87.5% of the data and gain a performance

boost of more than 73% as compared to the Wu-Manber algorithm on real Web traffic

and security-tool signatures. Furthermore, we show that the suggested algorithm also

gains a normalized throughput improvement of 51% as compared to ACCH. Finally, the

SPC algorithm reduces the additional space required for previous scan results by half,

by storing only 4KB per connection as compared to the 8KB of ACCH.

This work was published in the proceedings of HPSR 2011 [37].

1.2.4 Decompression-Free Inspection: DPI for Shared Dictionary

Compression over HTTP

Gzip works well as a compression method for each individual HTTP-response, but it

often happens that a lot of common data is shared by a group of pages. This type

of sharing is known as inter-response redundancy. Therefore, next generation Web

compression methods are inter-file, where there is one dictionary that may be referenced

by several files. An example of a compression method that uses a shared dictionary is

Shared Dictionary Compression over HTTP (SDCH).

SDCH [38] was proposed by Google Inc.; thus, Google Chrome (Google’s browser)

supports it by default. According to W3Schools [3], Google’s Chrome browser surpassed


Mozilla’s Firefox browser in March 2012 (after it surpassed Microsoft’s Internet Explorer

browser back in April 2011) to become the clear, dominant winner in the latest browser

wars. Thus, the popularity of SDCH compression should increase to the same degree.

Android is a software stack for mobile devices that includes an operating system,

middleware and key applications. The Android operating system, also introduced by

Google, is currently the world’s best-selling smartphone platform, with a 68.1% market

share worldwide [1]. SDCH code appears also in the Android platform and is likely to be

used in the near future. Therefore, a solution for DPI on shared dictionary compressed

data is essential for this platform as well. SDCH is complementary to gzip or Deflate,

i.e., it could be used before applying gzip. On Web pages containing Google search

results, the data size reduction when adding SDCH compression before gzip is about

40% better than gzip alone.

The idea of the shared dictionary approach is to transmit the data that is common

to each response once and after that send only the parts that differ. In SDCH notations,

the common data is called the dictionary and the differences are stored in a delta file.

A dictionary is composed of the data used by the compression algorithm, as well as

metadata describing its scope and lifetime. The scope is specified by the domain and

path attributes; thus, a user may download several dictionaries, even from the same

server.

In this work we present a novel pattern matching algorithm for SDCH. Our algorithm

operates in two phases, the offline phase and the online phase. The offline phase

starts when the device gets the dictionary. In this phase the algorithm uses the Aho-

Corasick [23] pattern matching algorithm to scan the dictionary for patterns and marks

auxiliary information to facilitate the scan of the delta files. Once received, the delta

file is scanned online using Aho-Corasick algorithm. Since the delta file eliminates

repetitions of strings using references to the common strings in the dictionary, our

algorithm tries to skip these reference, so each plain-text byte is scanned only once

(either in the offline or the online phases). We show that we skip up to 99% of the

referenced data and gain up to 56% improvement in the performance of the multi-

pattern matching algorithm, compared with scanning the plain text directly.

We are the first to address the problem of pattern matching algorithms for SDCH. In

addition, we have designed a novel algorithm that scans only a negligible number of bytes

more than once, as our evaluations confirm (see Section 6.3). This is a remarkable result


considering that bytes in the dictionary can be referenced multiple times by different

positions in one delta file and moreover, by different delta files. The SDCH compression

ratio is about 44%, implying that 56% of the data is copied from the dictionary. Thus,

in a single scan, our algorithm achieves 56% improvement over a plain text file scan.

Our algorithm also has low memory consumption. It stores only the dictionary being

used (along with some auxiliary information per dictionary). In the case of SDCH, since

it was developed for Web traffic, one dictionary usually supports many connections.

In other words, the memory consumption depends on the number of dictionaries and

their sizes and not on the number of connections, in contrast to intra-file compression

methods.

Finally, an important contribution is a mechanism to deal with matching regular-

expression signatures in SDCH-compressed traffic. Regular expression signatures are

becoming increasingly popular due to their superior expressibility [26]. We show how to

use our algorithm as a building block for regular expression matching. Our experiments

show that our regular expression matching mechanism gains a similar 56% boost in

performance.

This work was published in the proceedings of Infocom 2012 [33].

1.2.5 MCA2: Multi Core Architecture for Mitigating

Complexity Attacks

This work deals with complexity attacks, which exploit the gap between the amount of

resources the system requires in processing normal packets and carefully crafted packets

that consume drastically more resources (computing, memory, cache, or other). These

crafted packets, which we call heavy packets, are easy to construct but require very

intensive processing from the system. This implies that a small effort on the attacker’s

side leads to a great effort on the part of the system, which is bound to lose.

We present MCA2—a Multi-Core Architecture for Mitigating Complexity Attacks.

MCA2 essentially isolates the malicious traffic to a fraction of the cores and deals with

legitimate traffic on the remaining cores, which are therefore not affected by the attack.

Our MCA2 system can be configured to mitigate any complexity attack with the

following properties:

1. There are heavy and normal packets, where heavy packets consume considerably

more resources from the security device when being processed.


2. There is a method to identify heavy packets that requires very few resources.

3. Packets can be moved efficiently between system cores.

4. There is a special method that handles heavy packets more efficiently than the

method used for normal packets.1

It turns out that there are quite a few complexity attacks that meet these criteria.

However, we restrict our discussion to the DPI component of NIDS/NIPS. We consider

three examples that have the above properties: cache-miss attack on Snort’s signature

detection engine; active states explosion attack on the Hybrid-FA [27] regular expression

detection engine; and forced construction attack on Bro IDS regular expression detection

engine.

We focus on the first example and use it to explain our method and the above-

mentioned properties. We then show that the active states explosion complexity attack

fits our requirements as well. We back up all our findings with experimental results,

showing the benefits of using MCA2 in conjunction with the NIDS. For the force con-

struction attack example, we look at the Bro IDS regular expression detection engine.

Bro takes a lazy approach in order to cope with the large DFA size. Namely, it con-

structs only the DFA parts it actually uses. Normal traffic uses only a small part of the

DFA. Hence, a simple complexity attack forces Bro to construct a large portion of the

DFA, which significantly degrades performance.

With regard to our main example, we target Snort’s DPI engine, which uses some

variant of the Aho-Corasick (AC) [23] algorithm for performing pattern matching. A

complexity attack on the Aho-Corasick algorithm (in a stand-alone environment) is

shown in [34]: Aho-Corasick uses a large DFA that cannot fit entirely in the cache. The

common traffic, however, uses only a very small part of it, resulting in fast memory

references and few cache misses. An attacker can easily craft malicious packets that

cause an exhaustive traversal over the DFA, which in turn pollutes the cache. In this

work, we show for the first time that Snort is indeed vulnerable to this attack: an attack

on its DPI component degrades its overall performance by a factor of 4.2.

After establishing that the threat of this attack is real, we turn to investigate how

MCA2 mitigates such an attack. The key challenge is how to detect and isolate malicious

1This special method usually handles normal packets poorly; otherwise it would have been used bythe system in the first place.


traffic. This is done in two steps. First, training data is used to identify and mark the

common states of the DFA. These are the frequently-visited states while processing

normal common traffic. Then, for each packet, we count the fraction of non-common

states visited (out of the total number of states traversed by the packet). As soon as

this fraction exceeds a certain threshold, the packet is marked heavy. When the fraction

of heavy packets is above a second threshold, we allocate one or more cores to deal with

them exclusively, while the rest of the cores continue to process only normal traffic (and

to detect heavy packets); each subsequent heavy packet is moved to one of the dedicated

cores. This process isolates the effect of heavy packets and protects the private caches

of the non-dedicated cores from pollution. MCA2 can be further optimized by running

on the dedicated cores an implementation that is optimized for heavy packets (albeit

with penalty in the normal case).

The main performance measure we use is the goodput of the system, namely the

volume of the non-malicious packets that were processed. Our experimental results

are summarized in Fig. 1.1, which shows the system’s goodput under different attack

intensities (namely, in attack intensity of 50%, half of the incoming traffic is malicious).

We compare the performance of MCA2 with two implementations of the Aho-Corasick

algorithm: the first, denoted “Full Matrix AC,” is optimized for well-behaved normal

traffic, and the second, denoted “Compressed AC,” is optimized to work under cache-

miss attack (as described in Section 2.1).

When the system is not allowed to drop packets, MCA2 uses “Full Matrix AC” on

the cores that process normal traffic and “Compressed AC” on the dedicated cores. The

number of cores of each type is dynamically determined as a function of the attack level.

When there is no attack, MCA2 is reduced to “Full Matrix AC.”

We also consider the case when the NIDS/NIPS is allowed to drop packets. Drop-

ping all heavy packets implies that no dedicated threads are required, freeing up all

processing resources for the detection of heavy packets and processing of non-heavy

(mostly legitimate) packets, thus increasing the goodput.

Our experiments show a significant goodput improvement: MCA2 achieves up to

twice the goodput of both implementations, even without dropping packets. Further-

more, it always outperforms a hybrid implementation that chooses the best of the

previous implementations at any given time, with a goodput boost of up to 73%.

As for the second example, we use the regular expressions Hybrid-FA data structure

1.3. RELATED WORK 13

0 20 40 60 80 1000

1000

2000

3000

4000

5000

6000

7000

8000

9000

Attack Intensity [%]

Go

od

pu

t [M

bp

s]

Full Matrix ACCompressed ACMCA2

MCA2 with drop

Figure 1.1: The goodput of MCA2 for different attack intensities. MCA2 with no dropsmaintains a balance between all cores.

to illustrate an active states explosion attack. Hybrid-FA uses a single “head-DFA" for

commonly-used states while other parts of the automaton are kept as separate DFAs,

which are activated simultaneously when required. Usually, only the “head-DFA" is

activated. Thus, our complexity attack causes the Hybrid-FA to activate many states

in parallel, therefore causing the system to traverse several states per input byte; this

degrades system throughput significantly. We show that MCA2 in full-drop setup can

mitigate such an attack: our experiments show that under a mild active states explosion

attack, the goodput of the system is increased by a factor of 4.8.

This work was published in the proceedings of ANCS 2012 [20].

1.3 Related Work

The following section provides related work regarding the research presented in this

thesis.

1.3.1 DFA Compression

Intensive efforts have been made to implement compact Aho-Corasick-like DFA that

can fit into faster memory.

Van-Lunteren [89] proposed a novel architecture that supports prioritized tables.

His results are equal to CompactDFA presented in Chapter 3, with a suffix tree that is

limited to depth 2, thus having 25 (66) times more rules than the CompactDFA solution

for Snort (ClamAV). CompactDFA in some sense is a generalization of [89], which


eliminates all cross-transitions. Song et al. [85] proposed an n-step cache architecture

to eliminate some of the DFA’s cross-transitions. This solution still has 4 (9) times

more rules for Snort (ClamAV) than in CompactDFA. In addition, this solution, like

other hardware solutions [76, 87], uses dedicated hardware and thus is not flexible.

As far as we know, CompactDFA is the first proposed method for reducing the

number of transitions in DFA to the minimum possible one, the number of DFA states.

CompactDFA does not depend on any specific hardware architecture or any statistical

property of data (as opposed to the work of Tuck et al. [88]).

The papers [96] and [86, 93] encode segments of the patterns in the TCAM and do

not encode the DFA rules. However, both solutions require significantly larger TCAM

(especially [93]) and more SRAM (an order of magnitude more). The work of Lin

et al. [66] encodes the DFA rules in TCAM, just as CompactDFA does. CompactDFA

and [66] are based on the same basic observation, that we can eliminate cross-transitions

by using information from the next state label. However, [66] does not use the bits of

the state to encode the information; on the contrary, they just append to each state

code the last m bytes of its corresponding label to eliminate cross-transitions to depth

m. Thus, for depth 4, [66] requires 62 bits while CompactDFA requires only 36 bits,

and hence the solution is not scalable.

A recent work presented a method for state encoding in a TCAM-based implemen-

tation of Aho-Corasick NFA rather than Aho-Corasick DFA [97]. While such a method,

which was developed concurrently with ours, shares some of the insights of our work

(e.g., it also eliminates all failure transitions), it is limited to the TCAM implementa-

tions where CompactDFA may be used with any known IP-lookup solution. In addition,

unlike our work, the method in [97] does not deal with pipelined TCAM implementa-

tions (which are common in contemporary TCAM chips) and therefore suffers from

significant performance degradation if such TCAMs are used.

Following our work, several methods to perform regular expression matching using

TCAM [71, 78] were suggested. These methods rely on the same high-level principle

of our work: exploiting the degree of freedom in the way states are encoded. Since

these methods deal with regular expression rather than exact string matching, they do

not use AC-DFA, but other automata that are geared to handle regular expressions.

Specifically, [71] uses D2FA, while [78] uses both a DFA and a knowledge derived from a

corresponding NFA; both methods then construct a tree (or a forest) structure, which is

1.3. RELATED WORK 15

encoded similarly to CompactDFA. Finally, unlike our work, the methods in [71, 78, 97]

do not deal with pipelined TCAM implementations (which are common in contemporary

TCAM chips) and therefore suffer from significant performance degradation if such

TCAMs are used.

Two additional methods that use TCAMs to handle regular expression matching

were presented by Liu et al. [68] and Zheng et al. [99]. These methods present orthogonal

improvements to utilizing TCAMs. Specifically, Liu et al. [68] method is based on

implementing a fast and cheap pre-filter, so that only portion of the traffic should be

fully-inspected; on the other hand, Zheng et al. [99] suggest a technique that parallelizes

the process of using TCAM by smartly dividing the pattern rule set and the flows to

different TCAM blocks. Naturally, these two approaches can be easily combined with

ours.

Finally, we note that [71] introduces the table consolidation technique, which com-

bines entries even if they lead to different states. This technique trades TCAM memory

with a cheaper SRAM memory that stores the different states of each combined entry.

Table consolidation, which requires solving complicated optimization problems, can be

applied also to our results to further reduce TCAM space.

1.3.2 Compressed Web-Traffic

Extensive research has been conducted on performing pattern matching on compressed

files as in [25, 59, 72, 73], but very limited work has been done on compressed traffic.

Requirements for dealing with compressed traffic are: (1) on-line scanning (1-pass),

(2) handling thousands of connections concurrently and (3) working with the LZ77

compression algorithm, which is used by gzip, (as opposed to most papers, which deal

with LZW/LZ78 compressions). To the best of our knowledge, [47, 52] are the only

papers that deal with pattern matching over LZ77. However, the algorithms are for a

single pattern and require two passes over the compressed text (file), which is not an

option in network domains that require “on-the-fly" processing.

Klein and Shapira [60] have suggested a modification to the LZ77 compression al-

gorithm, to change the backward pointer into forward pointers. That modification

makes the pattern matching easier in files and may save some of the space required by

the 32KB buffer for each connection. However, their proposal is not implemented in

today’s HTTP.


The first paper to analyze the obstacles in dealing with compressed traffic is [36],

but it only accelerated the pattern matching task on compressed traffic and did not

handle the space problem. Furthermore, it still requires the decompression.

Techniques have been developed for in-place compression, the main one being

LZO [75]. While LZO claims to support decompression without memory overhead,

it works with files and assumes that the uncompressed data is available. We assume

decompression of thousands of concurrent connections on-the-fly without having the

uncompressed data available. Thus what is for free in LZO is considered overhead in

compressed Web traffic. Furthermore, while gzip is considered the standard for Web

traffic compression, LZO is not supported by any Web server or Web browser.

1.3.3 DPI Using Multi-Core Platforms

The recent proliferation of multi-core general purpose processors motivated many re-

searchers to reinvestigate well-known problems in this new domain. Among these are

several works that proposed a multi-core solution for DPI processing. These papers’

main focus is on different ways to load balance the system tasks between the available

cores.

Current NIDS/NIPS systems such as Snort [9] and Bro [2] split the load to many

sequential sub-tasks in a pipeline manner. Other works, such as [91], suggest fine-

grained pipelining for parallelizing network applications on multi-core architectures.

This partitioning is effective if the processing cost for each sub-task is similar, which is

usually not the case for NIDS/NIPS.

A different line of research focuses on load balancing the traffic flows equally between

the different cores and performing the inspection in parallel [41, 53, 67, 74, 83]. The load

balancing is based on both the packet header parameters and some layer-7 parameters.

We note that such architectures are orthogonal to our MCA2 algorithm (see Chapter 7)

and may be applied to load balance the work between general threads that process the

normal traffic. If MCA2 is not used in conjunction with these architectures, they are

all vulnerable to complexity attacks.

Becchi et al. [30] focus on DPI and present a performance evaluation scheme for

multiprocessor systems. The proposed design also splits the traffic between several

cores with the same DPI engine that supports regular expression matching. Their

study identifies and evaluates algorithmic and architectural trade-offs and limitations,

1.4. SIGNIFICANCE 17

and highlights how the presence of caches affects the overall performance. However, it is

geared at optimizing the normal case and is vulnerable to similar complexity attacks as

those we describe in this work. Such attacks can be mitigated by incorporating MCA2

into this scheme as well.

Another multi-core load-balancing approach is to partition the patterns among the

cores (cf. [90, 95, 98]). Then different DPI algorithms, each specializing in different

kinds of pattern sets, are run on each core. In some cases, the partitioning itself is

done so as to balance the load between the algorithms. It is important to note that

architectures of this kind differ from MCA2 in that each packet is examined by several

cores (each performs only part of the inspection). In addition, they do not take into

account the incoming traffic, and are vulnerable to separate attacks on each core.

1.3.4 Denial-of-Service Mitigation

Kumar et al. [62] present several methods to reduce regular-expressions-based DFA

size. One of the mechanisms used in that paper is based on the assumption that normal

flows rarely match more than the first few symbols of any signature. Thus, the most

frequently visited portions of the automaton are used to build a fast path DFA, and the

rest of the automaton is represented by a separate NFA, which is the slow path. The

authors suggest a solution that is similar to MCA2 in that it handles heavy traffic with

a different algorithm and applies a lightweight classification algorithm to distinguish

between heavy and normal traffic. In addition, [62] proposes to protect against denial-

of-service (DoS) attacks by attaching lower priority to flows with higher probability

of being malicious. Nevertheless, that work analyzes the case of a single core, and

therefore could not benefit from the multi-core properties as MCA2 does. Furthermore,

the proposed protection in [62] fails under a continuous DoS attack because the heavy

packets that receive lower priority eventually overload the system buffer. MCA2 is also

resilient to DoS attacks of longer duration.

1.4 Significance

This thesis provides algorithms and techniques in the field of deep packet inspection for

high performance network security tools. These algorithms focus on three problems:

scalability, compressed traffic, and security-tool resiliency.


For the first topic, that of scalability, we are the first to provide a scheme that reduces

the pattern matching problem to the well-studied problem of Longest Prefix Matching

(LPM), which may be solved either in TCAM, in commercially available chips, or in

software.

For the second topic, that of compressed traffic, we are the first to address the

problem and to provide a set of state-of-the-art solutions that achieve good theoretical

and practical results.

As for the third topic, we have uncovered and demonstrated weaknesses of preva-

lent security tools for commercial networks, by devising a denial-of-service algorithmic

complexity attack over the Snort network intrusion detection server. Furthermore, we

are the first to incorporate the common multi-core platform architecture to mitigate

complexity attacks over network security tools.

Chapter 2

Background

In this chapter we provide background on topics that are relevant to the following

chapters, which are “pattern matching", “compressed traffic" and “complexity attacks".

2.1 DFA based Pattern Matching

DPI is a major component in contemporary security tools, which heavily relies on

pattern matching to detect signatures of malicious traffic. We consider the following

two classes of pattern matching: exact matching and regular expression matching. The

former usually uses a deterministic finite automaton (DFA), while the latter uses either

a DFA or a non-deterministic finite automaton (NFA) for the ongoing inspection of the

input data [54]. A sub-category of the latter class is the Ternary Content Addressable

Memory (TCAM) based regular expression matching, which encodes the DFA rules

using TCAM elements (as discussed in Chapter 3).

In our main example, we mostly focus on the exact matching algorithms, which

use DFA. A DFA is a five-tuple 〈S,Σ, δ, q0, F 〉, where S is a finite set of states, Σ is a

finite set of input symbols, δ : S × Σ → S is a transition function, returning the next

state, given the current state and any symbol from the input, s0 ∈ S is the initial state,

and F ⊆ S is a set of accepting states. Aho-Corasick algorithm provides a method to

build such an automaton (a.k.a. Aho-Corasick DFA) from a set of patterns. Given the

DFA, a packet is inspected by traversing the automaton symbol by symbol from s0; a

pattern is detected if a state in F is reached in this traversal. Fig. 2.1(a) depicts the

Aho-Corasick DFA for the pattern-set {E,BE,BD,BCD,CDBCAB,BCAA}.

In today’s security tools, Aho-Corasick DFAs are huge—e.g., Snort’s Aho-Corasick

19

20 CHAPTER 2. BACKGROUND

B

E

CB

E

CBE

C

DE

BC

D

E C

E

B CE

B

C

E

B C

E

BC

B B

s0

s7

s12

s1 s2

s3 s5s4

s14

s13 s6

s8

s9

s10

s11

C

C

E

D

B

E D

D B

C

A

B

A

A

(a) The Aho-Corasick DFA for pattern-set {E, BE, BD, BCD, CDBCAB,BCAA}

(b) Full-matrix Encoding

s 0

s 7s 2

s 5

C

C

E *

D

BD*

s 13

D*B C AB *

A

A*

(B )

(B C )

(B C A)

E *

s 8

(c) Compressed Automaton

S0

B: 2, C: 7, E: 0*, fail: 0

S2

C: 5, D: 0*, E: 0*, fail: 0

S5

A: 13, D: 8*, fail: 0

S7

D: 8, fail: 0

S8

BCAB: 2*, BCA: 13, BC: 5, B:2, fail:0

S13

A: 0*, fail: 0

(d) Compressed Encoding

Figure 2.1: Example of an Aho-Corasick DFA and two methods to store it in mem-ory: non-compressed (full-matrix) encoding, and compressed encoding. The compressedencoding is derived from a compressed automaton, in which fail transitions are takenwithout consuming input symbols, and transitions marked with ‘*’ indicate that a matchwas found.

2.1. DFA BASED PATTERN MATCHING 21

DFA has 77, 182 states for 31, 094 patterns—raising the question of how to store it

efficiently in memory. The alternatives naturally trade memory space with execution

time. In addition, most security tools (including Snort) divide their patterns to several

sets, according to the type of traffic.

Snort uses a full-matrix encoding for its Aho-Corasick DFAs as presented in [23]. In

this representation (see Fig. 2.1(b)), transitions are stored in a two-dimensional array

with |S| rows and |Σ| columns. An entry at position (i, j) holds the value of δ(si, j),

implying that the number of bits in each entry is at least log2 |S|. In the typical

case, when the input is inspected one byte at a time, |Σ| = 256, resulting in overall

memory footprint of 256|S| log2 |S|. For Snort’s Aho-Corasick DFAs, this translates

to a combined footprint of 75.15 MB. On the other hand, the main advantage of this

encoding is that a transition consists of a single memory load operation that reveals

directly the next state.

An alternative approach is to implement an AC automaton using the concept of

failure transitions. In such implementations, only part of the outgoing transitions from

each state are stored explicitly. While traversing the automaton, if the transition from

state s with symbol x is not stored explicitly, one will take the failure transition from

s to another state s′ and look for an explicit transition from s′ with x. This process is

repeated until an explicit transition with x is found, resulting in failure paths. Naturally,

since only part of the transitions are stored explicitly, these implementations (sometimes

referred to as AC NFAs) are more compact, but incur higher processing time. A classical

result states that the longest failure path is at most the size of the longest pattern,

and that, regardless of the traffic pattern, the total number of transitions (failure and

explicit) is at most twice the number of symbols. This result does not take into account

the representation of each single state, which determines the time it takes to figure out

whether an explicit rule exists or not.

We use the following definitions regarding this encoding: Let the label of a state

s, denoted by L(s), be the concatenation of symbols along the path from the root to

s. Furthermore, let the depth of a state s be the length of the label L(s). The failure

transition from s is always to a state s′, whose label L(s′) is the longest suffix of L(s)

among all other DFA states. This implies the following property of the Aho-Corasick

DFA:

Property 1. If L(s′) is a suffix of L(s) then there is a failure path (namely, a path


comprised only of failure transitions) from state s to state s′.

The DFA is traversed starting from the root. When the traversal goes through an

accepting state, it indicates that some patterns are a suffix of the input; one of these

patterns always corresponds to the label of the accepting state. Formally, we denote

by s.output the set of patterns matched by state s; if s is not an accepting state then

s.output = ∅. Finally, we denote by scan(s, b), the AC procedure when reading input

symbol b while in state s; namely, transiting to a new state s′ after traversing failure

transitions and a forward transition as necessary, and reporting matched patterns in

case s′.output 6= ∅. scan(s, b) returns the new state s′ as an output. The correctness of

the AC algorithm essentially stems from the following simple property:

Property 2. Let b1, . . . bn be the input, and let s1, . . . , sn be the sequence of states the

AC algorithm goes through, after scanning the symbols one by one (starting from the

root of the DFA). For any i ∈ {0, . . . , n}, L(si) is a suffix of b1, . . . , bi; furthermore, it

is the longest such suffix among all other states of the DFA.

There are other encodings that require more than one memory access, but offer

significant memory reduction. Several such encodings exist in the literature [29, 34, 88].

Fig. 2.1(d) depicts one such alternative, as suggested in [34]; this encoding is based on

a compressed automaton as depicted in Figure 2.1(c).

The construction of AC’s DFA is done in two phases. First, the algorithm builds a

trie of the pattern set: All the patterns are added from the root as chains, where each

state corresponds to a single symbol. When patterns share a common prefix, they also

share the corresponding set of states in the trie. In the second phase, additional edges

are added to the trie. These edges deal with situations where the input does not follow

the current chain in the trie (that is, the next symbol is not an edge of the trie) and

therefore we need to transit to a different chain. In such a case, the edge leads to a

state corresponding to a prefix of another pattern, which is equal to the longest suffix

of the previously matched symbols.

It is sometimes useful to look at the DFA as a directed graph whose vertex set is S

and there is an edge between s1 and s2 with label x if and only if δ(s1, x) = s2. The

input is inspected one symbol at a time: Given that the algorithm is in some state s ∈ S

and the next symbol of the input is x ∈ Σ, the algorithm applies δ(s, x) to get the next

state s′. If s′ is in F (that is, an accepting state) the algorithm indicates that a pattern

was found. In any case, it then transits to the new state s′.

2.2. COMPRESSED WEB-TRAFFIC 23

We use the following simple definitions to capture the meaning of a state s ∈ S:

The depth of a state s, denoted depth(s), is the length (in edges) of the shortest path

between s0 and s. The label of a state s, denoted label(s), is the concatenation of the

edge symbols of the shortest path between s0 to s. Further, for every i ≤ depth(s),

suffix(s, i) ∈ Σ∗ (respectively, prefix(s, i) ∈ Σ∗) is the suffix (prefix) of length i of

label(s). The code of a state s, denoted code(s), is the unique number that is associated

with the state, i.e., the number that encodes the state. Traditionally, this number is

chosen arbitrarily; in this work we take advantage of this degree of freedom.

We use the following classification of DFA transitions (cf. [85]):

• Forward transitions are the edges of the trie; each forward transition links a

state of some depth d to a state of depth d+ 1.

• Cross transitions are all other transitions. Each cross transition links a state of

depth d to a state of depth d′ where d′ ≤ d. Cross transitions to the initial state

s0 are also called failure transitions, and cross transitions to states of depth 1

are also called restartable transitions.

2.2 Compressed Web-Traffic

This section provides an overview of the main techniques that are used to compress

Web traffic in the Internet.

2.2.1 Gzip Compression

HTTP 1.1 [19] supports the usage of content-codings to allow a document to be com-

pressed. The RFC suggests three content-codings: gzip, compress and deflate. In fact,

gzip uses deflate as its underlying compression protocol. For the purpose of this the-

sis they are considered the same. Currently gzip and deflate are the common codings

supported by current browsers and Web servers (analyzing captured packets from the

latest versions of both Internet Explorer, FireFox and Chrome browsers shows that

these browsers accept only the gzip and deflate codings).

The gzip algorithm uses a combination of the following compression techniques: first

the text is compressed with the LZ77 algorithm and then the output is compressed with

the Huffman coding. Let us elaborate on the two algorithms:


(a) (b)

Figure 2.2: Example of the LZ77 compression on the beginning of the Yahoo! homepage (a) Original (b) After the LZ77 compression

LZ77 Compression The purpose of LZ77 [100] is to reduce the string presenta-

tion size, by spotting repeated strings within the last 32KB of the uncompressed data.

The algorithm replaces a repeated string by a backward-pointer consisting of a (dis-

tance,length) pair, where distance is a number in [1,32768] (32K) indicating the distance

in bytes of the string and length is a number in [3,258] indicating the length of the re-

peated string. For example, the text: ‘abcdeabc’ can be compressed to: ‘abcde(5,3)’;

namely, “go back 5 bytes and copy 3 bytes from that point". LZ77 refers to the above

pair as “pointer" and to uncompressed bytes as “literals".

Fig. 2.2 depicts an example extracted from the ‘Yahoo!’ home page after LZ77

compression. Note that decompression has a moderate time consumption, since it reads

and copies sequential data blocks, hence relying on spatial locality that requires only a

few memory references.

Huffman Coding The second algorithm used by gzip is the Huffman coding. This

method works on a character-by-character basis, transforming each 8-bit character to a

variable-size codeword ; the more frequent the character is, the shorter its corresponding

codeword. The codewords are coded such as no codeword is a prefix of another, so the

end of each codeword can be easily determined. Dictionaries are provided to facilitate

the translation of binary codewords to bytes.

In the gzip format, Huffman codes both ASCII characters (that is literals) and point-

ers into codewords using two dictionaries, one for the literals and the pointer lengths

and the other for the pointer distances. Huffman may use either fixed or dynamic dic-

tionaries, where the latter gains better compression ratio. The Huffman dictionaries for

the two alphabets appear immediately after the header bits and prior to the compressed

data.

2.2. COMPRESSED WEB-TRAFFIC 25

A common implementation of Huffman decoding (cf. zlib [17]) uses two levels of

lookup tables. The first level stores all codewords of a length shorter than 9 bits in a

table of 29 entries that represents all possible inputs; each entry holds the symbol value

and its actual length. If a symbol exceeds 9 bits, there is an additional reference to

a second lookup table. Thus, in most of the cases, decoding a symbol requires only a

single memory reference, while for the less frequent symbols it requires two.

2.2.1.1 Challenges in performing DPI on Compressed HTTP

While transparent to the end-user, compressed Web traffic needs special care by bump-

in-the-wire devices that reside between the server and the client and perform DPI. The

device needs first to decompress the data in order to inspect its payload since there is no

apparent “easy" way to perform DPI over compressed traffic without decompressing the

data in some way. This is mainly because LZ77 is an adaptive compression algorithm,

namely the text represented by each symbol is determined dynamically by the data.

As a result, the same substring is encoded differently depending on its location within

the text. For example the pattern ‘abcdef’ can be expressed in the compressed data by

abcde ∗j (j + 5, 5)f for all possible j < 32763.

One of the main problems with the decompression is its memory requirement; the

straightforward approach requires a 32KB sliding window for each connection. Note

that this requirement is difficult to avoid, since the back-reference pointer can refer to

any point within the sliding window and the pointers may be recursive (i.e., a pointer

may point to an area with a pointer). As opposed to compressed traffic, DPI of non-

compressed traffic requires storing only two or four bytes variable that holds the corre-

sponding DFA state aside of the DFA itself, which is of course stored in any case. Hence,

dealing with compressed traffic poses a significantly higher memory requirement by a

factor of 8 000 to 16 000. Thus, mid-range firewall that handles 100K-200K concurrent

connections (like GTA’s G-800 [12], SonicWall’s Pro 3060 [13] or Stonesoft’s StoneGate

SG-500 [14]), needs 3GB-6GB memory while a high-end firewall that supports 500K-

10M concurrent connections (like the Juniper SRX5800 [15] or the Cisco ASA 5550 or

5580 [11]) would need 15GB-300GB memory only for the task of decompression. This

memory requirement not only imposes high price and infeasibility of the architecture

but also implies on the capability to perform caching or using fast memory chips such

as SRAM. Hence, reducing the space boosts the speed also because faster memory tech-


nology is becoming a viable option, such as SRAM memory. This work deals with the

challenges imposed by that space aspect.

Apart from the space penalty described above, the decompression stage also in-

creases the overall time penalty. However, we note that DPI requires significantly more

time than decompression, since decompression is based on reading consecutive mem-

ory locations and therefore enjoys the benefit of cache block architecture and has low

per-byte read cost, where as DPI employs a very large data structure that is accessed

by reads to non-consecutive memory areas therefore requires expansive main memory

accesses. In [36] we provided an algorithm that takes advantage of information gathered

by the decompression phase in order to accelerate the commonly used Aho-Corasick pat-

tern matching algorithm. By doing so, we significantly reduced the time requirement

of the entire DPI process on compressed traffic.

2.2.2 SDCH Compression

2.2.2.1 The SDCH Framework

SDCH is a new compression mechanism proposed by Google Inc. In SDCH, a dictionary

is downloaded (as a file) by the user agent from the server. The dictionary contains

strings which are likely to appear in subsequent HTTP responses. If, for example, the

header, footer, JavaScript and CSS are stored in a dictionary possessed by both user

agent and server, the server can construct a delta file by substituting these elements with

references to the dictionary, and the user agent can reconstruct the original page from

the delta file using these references. By substituting dictionary references for repeated

elements in HTTP responses, the payload size is reduced and we can save the cross-

payload redundancy. In order to use SDCH, the user agent adds the label SDCH in

the Accept-Encoding field of the HTTP header. The scope of a dictionary is specified

by the domain and path attributes, thus, one server may have several dictionaries and

the user agent has to have a specific dictionary in order to decompress the server’s

compressed traffic. If the user agent already has a dictionary from the negotiated

server, it adds the dictionary id as a value to the header Avail-Dictionary. If the user

agent does not have the specific dictionary that was used by the server, the server sends

an HTTP response with the header Get-Dictionary and the dictionary path; now, the

user agent can construct a request to get the dictionary.

2.3. COMPLEXITY ATTACK 27

2.2.2.2 The VCDIFF Compression Algorithm

SDCH encoding is built upon the VCDIFF compression data format. VCDIFF encoding

process uses three types of instructions, called delta instructions: add, run and copy.

add(i, str) means to append to the output i bytes, which are specified in parameter

str. run(i, b) means to append i times the byte b. Finally, copy(p, x) means that

the interval [p, p + x) should be copied from the dictionary (that is, x bytes starting

at position p). The delta file contains the list of instructions with their arguments and

the dictionary is one long string composed of the characters that can be referenced

by the copy instructions in the delta file. In the rest of this thesis, we ignore the

run instruction since it is barely used and can be replaced with an equivalent add for

our purposes.

For example, suppose that the dictionary is DBEAACDBCABC, and the delta file is

given by the following commands:

1. add (3,ABD)

2. copy (0,5)

3. add (1,A)

4. copy (4,5)

5. add (2,AB)

6. copy (9,3)

7. add (4,AACB)

8. copy (5,3)

9. add (1,A)

10. copy (6,3)

Thus, the plain-text that should be considered is therefore (bolded bytes were copied

from the dictionary):

ABDDBEAAAACDBCABABCAACBCDBADBC

2.3 Complexity attack

In a complexity attack, the attacker exploits the system’s worst-case performance, which

differs from the average case that the system was designed for. Crosby and Wallach were

among the first to demonstrate the phenomenon on the commonly-used Open Hash data


structure [43]: an attacker designs an input that requires O(n) elementary operations

per insertion, instead of O(1) operations that are required on the average.

Recent works show that many other systems and algorithms are vulnerable to com-

plexity attacks including QuickSort [70], regular expression matcher [79], intrusion de-

tection systems [34, 48, 82], the Linux route-table cache [92], SSL authentication al-

gorithm [40], and the retransmission algorithm in wireless networks [31]. Complexity

attacks on different components of NIDS/NIPS were suggested in the past. For exam-

ple, Bro maintains a hash table with the IP header fields of packets as keys; thus, by

tailoring the traffic with specific headers, one can cause the hash insert-operation to

last significantly longer, resulting in Bro failure. While in some cases modifying the

algorithm suffices to mitigate the problem (e.g., Crosby and Wallach’s attack can be

solved by using hash functions that are not known to the attacker), this does not hold

in general.

Chapter 3

CompactDFA: Generic State

Machine Compression for Scalable

Pattern Matching

In this chapter we propose a novel method to compress deterministic finite automata

(DFA), which is the common data structure for DPI. Compressing the DFA enables

storing the DFA in a faster memory, which in turn gains significant performance boost.

Related background for pattern matching using DFA is provided in Section 2.1. Related

work is in Section 1.3.1.

3.1 The CompactDFA Scheme

In this section we explain our CompactDFA Scheme. We begin by explaining the scheme

output, namely a compact encoding of the DFA and continue by describing the algorithm

and the intuition behind it.

3.1.1 CompactDFA Output

A straightforward encoding of the Aho-Corasick DFA is to store the set of rules (one

rule for each transition) with the following fields:

Current state field Symbol field Next state field

The output of the CompactDFA scheme is a set of compressed rules, such that there

is only one rule per state. This is achieved by cleverly choosing the code of the states.

29

30 CHAPTER 3. COMPACTDFA

Unlike traditional AC-like algorithms, in our compact DFA each rule has the following

structure:

Set of current states Symbol Field Next state code

The set of current states of each rule is written in a prefix style, i.e., the rule captures

all states whose code matches a specific prefix. Specifically, for each state s, let N(s)

be the incoming neighborhood of s, namely all states that has an edge to s. For every

state s ∈ S, we have one rule where the current state is the common prefix of the code

of the states in N(s) and the next state is s. Note that the symbol that transfers each

state in N(s) to a state s is common for all the states in N(s) due to AC-like algorithm

properties (see Property 2 in Section 3.1.3).

Fig. 3.1(c) shows the rules produces by CompactDFA on the DFA of Fig. 3.1(a).

For example, Rule 5 in Fig. 3.1(c), which is 〈010**, D, 11010(s11)〉, is the compressed

rule for next state s11 and it replaces three original rules: 〈01000(s3), D, 11010(s11)〉,

〈01001(s5), D, 11010(s11)〉, and 〈01010(s10), D, 11010(s11)〉.

In the compressed set of rules, a code of a state may match multiple rules. Very

similar to forwarding table in IP networks, the rule with the Longest Prefix Match (LPM)

determines the action. In our example, this is demonstrated by looking at Rules 6 and

10 in Fig. 3.1(c). Suppose that the current state is s8, whose code is 00010, and the

symbol is A. Then, Rule 10 is matched since 00*** is a prefix of the current state. In

addition, Rule 6, with current state 000**, is also matched. According to the longest

prefix match rule, Rule 6 determines the next state.

3.1.2 CompactDFA Algorithm

This section describes the encoding algorithm of CompactDFA and gives the intuition

behind each of its three stages: State Grouping (Algorithm 1, Section 3.1.4), Common

Suffix Tree Construction (Algorithm 2, Section 3.1.5), and State and Rule Encoding

(Algorithm 3, Section 3.1.6).

The first stage of our algorithm is based on the following insight: Suppose that each

state s is encoded with its label; our goal is to encode with a single rule the incoming

neighborhood N(s), which should appear in the first field of the rule corresponding to

next state s. Note that the labels of all states in N(s) share a common suffix, which

is the label of s without its last symbol. Thus, by assigning code(N(s)) to be label(s)

without its last symbol, padded with “don’t care” symbols in its beginning, and applying

3.1. THE COMPACTDFA SCHEME 31

s0

s12

s13

s1 s6

s7 s10s8

s9

C

C

M

F

B

A B

s5

s4

s2

B

B

C

A DC

S11s3

(a)

B C

B B

00

0

01

s3,s5,s10s4,s8 s2,s6

11

10

s12

0

Connecting

Connecting

Connecting

s0,s1,s7

Connecting

s9,s11,s13

11

(b)

00010 (s2)B100014

Nxt StateSymbolCurrent state

01001 (s5)C000001

01000 (s3)C000102

00000 (s4)B000103

11010 (s11)D010**5

11000 (s9)A000** 6

11100 (s13)F01***7

01010 (s10)C00***8

00001 (s8)B00***9

10100 (s7)A00***10

10010 (s1)M*****11

01100 (s12)C*****12

00011 (s6)B*****13

10000 (s0)******14

00010 (s2)B100014

Nxt StateSymbolCurrent state

01001 (s5)C000001

01000 (s3)C000102

00000 (s4)B000103

11010 (s11)D010**5

11000 (s9)A000** 6

11100 (s13)F01***7

01010 (s10)C00***8

00001 (s8)B00***9

10100 (s7)A00***10

10010 (s1)M*****11

01100 (s12)C*****12

00011 (s6)B*****13

10000 (s0)******14

(c)

Figure 3.1: A toy example. (a) Aho-Corasick DFA for the patterns{EBC,EBBC,BA,BBA,BCD,CF}. Failure and restartable transitions are omitted forclarity. (b) The Common Suffix Tree; (c) The Rules of the compressed DFA.

32 CHAPTER 3. COMPACTDFA

a longest suffix match rule, one captures correctly the transitions of the DFA.

For example, consider Fig. 3.1(a). The code of state s7 is BA. N(s7) = {s6, s2},

label(s6) = B and label(s2) = EB; their common suffix is B, and indeed the code of

N(s7) is “***B”. On the other hand, code(N(s9)) = code({s4, s8}) = “**BB”; thus, if

the current state is s4, whose label is EBB, and the symbol is A, the next state is s9

whose corresponding rule has longer suffix than the rule corresponding to s7.

As demonstrated above, the longest suffix match rule should be applied to resolve

conflicts when more than one rule is matched. Intuitively, this encoding is correct since

all incoming edges to a state s (alternatively, all edges from N(s)) share the same suffix,

which is code(N(s)). Moreover, a cross transition edge from a state s with symbol x

always ends up at a state s′ whose label is the longest suffix (among all state labels) of

the concatenation of label(s′) with x.

However, this code is, first and foremost, extremely wasteful (and thus unpractical),

requiring a 32 bit code for the automaton of Fig. 3.1(a) (namely, to encode 4 byte labels)

and hundreds of bits for Snort’s DFA. In addition, it uses a longest suffix match

high performance deep packet inspection · duces an algorithm for scanning traﬃc compressed by...

Documents