high performance deep packet inspection · duces an algorithm for scanning traffic compressed by...
TRANSCRIPT
-
The Raymond and Beverly Sackler Faculty of Exact Sciences
The Blavatnik School of Computer Science
High Performance Deep Packet Inspection
Thesis submitted for the degree of Doctor of Philosophy
by
Yaron Koral
This work was carried out under the supervision of
Professor Yehuda Afek and Doctor Anat Bremler-Barr
Submitted to the Senate of Tel Aviv University
September 2012
-
© 2012
Copyright by Yaron Koral
All Rights Reserved
-
This work is dedicated to the pursuit of a safe and secure world.
-
Acknowledgements
First and foremost, I would like to thank my advisors, Yehuda Afek and Anat
Bremler-Barr, for their continued support and guidance throughout my Ph.D. I have
learned a lot from you whether in doing research, writing papers or giving presentations.
Above all you taught me how to walk in the world of science and think sharply.
I had the pleasure of working with the following people: David Hay, Yotam Har-
chol, Shimrit Tzur-David and Victor Zigdon. I thank you for your companionship and
support. Working with you was both enriching and a great delight.
Last, and certainly the most, I thank my family: my beloved wife Keren; my charm-
ing kids Omer, Ofri, Romi and Yarden; and my parents Akiva and Rahel for their
unfailing love, encouragement and support.
The work in this thesis was partially supported by the European Research Council
under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC
Grant agreement no 259085.
-
Abstract
Deep packet inspection (DPI) is a form of network packet filtering that can search
the packet’s content and locate the presence of certain patterns. These include headers
and data-protocol structures as well as the payload of the message. It enables advanced
network management, user service, and security functions as well as Internet data min-
ing, eavesdropping, and censorship. It is currently being used by enterprises, service
providers, and governments in a wide range of applications.
DPI may be implemented by a wide range of pattern matching algorithms. The
general problem of pattern matching is considered fundamental in computer science
and has been researched thoroughly over the last decades. Still, when applied to the
network domain of recent years, the traditional algorithms fail to face current challenges.
The first challenge is the continual increase in Internet traffic rates, which requires a
scalable design in terms of speed and memory usage. The second challenge arises from
the increase in Web traffic compression due to the increasing popularity of Web surfing
over mobile devices. The security device is forced to decompress this traffic prior to
inspection, leading in turn to processing and space penalties. The third challenge is due
to the requirement for a solution that is resilient to attacks that overload the security
device. We address these challenges here. Moreover, we apply several technological
advances to boost the performance of the traditional algorithms, including, for example,
the presence of Ternary Content Addressable Memory (TCAM) elements in network
devices and the availability of multi-core platforms for the DPI task.
The work presented in this thesis focuses on DPI algorithms and techniques that
relate to network security elements. In Chapter 3, we provide an algorithm for a scalable
design of a DPI engine. Our design reduces the problem of pattern matching to the
well-studied problem of Longest Prefix Match (LPM), which can be solved either in
TCAM, in IP-lookup chips, or in software.
Next we deal with the challenge of DPI over compressed traffic. Chapters 4 and 5
focus on reducing the space and time penalties resulting from the compressed traffic.
-
These works show that, by using the meta-data generated during the compression stage,
pattern matching over compressed traffic can be accelerated significantly as compared
to traditional pattern matching over non-compressed traffic, and that the space penalty
can be reduced by a factor of six as compared to current designs. Chapter 6 intro-
duces an algorithm for scanning traffic compressed by SDCH compression, which is the
compression scheme used by Google. Our design gains a performance boost of over
40%.
Finally, we address the challenge of performing DPI when the system is under denial-
of-service via algorithmic complexity attacks. We provide a system design that takes
advantage of commercial multi-core platforms to efficiently mitigate complexity attacks
of varying intensity.
The algorithms and techniques presented in this thesis provide a suitable DPI solu-
tion that confronts today’s network challenges.
-
Contents
1 Introduction 1
1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Overview of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 CompactDFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 SOP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 SPC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 SDCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 DFA Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 DPI Using Multi-Core Platforms . . . . . . . . . . . . . . . . . . 16
1.3.4 Denial-of-Service Mitigation . . . . . . . . . . . . . . . . . . . . . 17
1.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Background 19
2.1 DFA based Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Compressed Web-Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Gzip Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 SDCH Compression . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Complexity attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 CompactDFA 29
3.1 The CompactDFA Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 CompactDFA Output . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 CompactDFA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 30
ix
-
3.1.3 The Aho-Corasick Algorithm-like Properties . . . . . . . . . . . 32
3.1.4 Stage I: State Grouping . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.5 Stage II: Common Suffix Tree . . . . . . . . . . . . . . . . . . . . 36
3.1.6 Stage III: State and Node Encoding . . . . . . . . . . . . . . . . 38
3.2 CompactDFA for total memory minimizations . . . . . . . . . . . . . . . 40
3.3 CompactDFA for DFA with strides . . . . . . . . . . . . . . . . . . . . . 41
3.4 Implementing CompactDFA using IP-lookup Solutions . . . . . . . . . . 43
3.4.1 Implementing CompactDFA with non-TCAM IP-lookup solutions 44
3.4.2 Implementing CompactDFA with TCAM . . . . . . . . . . . . . 45
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Space Efficient DPI of Compressed Web Traffic 57
4.1 SOP Packing technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Buffer Packing: Swap Out of boundary Pointers (SOP) . . . . . . 58
4.1.2 Huffman Coding Scheme . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.3 Unpacking the Buffer: Gzip Decompression . . . . . . . . . . . . 63
4.2 Combining SOP with ACCH algorithm . . . . . . . . . . . . . . . . . . . 64
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Space and Time Results . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.4 Time Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 DPI of Compressed Traffic . . . . . . . . . . . . . . . . . . . . . . 73
5 Shift-based Pattern Matching for Compressed Traffic 75
5.1 The Modified Wu-Manber Algorithm . . . . . . . . . . . . . . . . . . . . 75
5.2 Shift-based Pattern matching for Compressed traffic (SPC) . . . . . . . 78
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Pattern Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.3 SPC Characteristics Analysis . . . . . . . . . . . . . . . . . . . . 83
5.3.4 SPC Run-Time Performance . . . . . . . . . . . . . . . . . . . . 85
5.3.5 SPC Storage Requirements . . . . . . . . . . . . . . . . . . . . . 86
-
6 Decompression-Free Inspection 88
6.1 Our Decompression-Free algorithm . . . . . . . . . . . . . . . . . . . . . 88
6.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.4 Dealing with Gzip over SDCH . . . . . . . . . . . . . . . . . . . . 97
6.2 Regular Expressions Inspection . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 MCA2 104
7.1 Snort Cache-Miss Complexity Attack . . . . . . . . . . . . . . . . . . . . 104
7.2 The MCA2 System Description . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.1 MCA2 Design overview . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.2 Cross-Thread Communication Mechanism . . . . . . . . . . . . . 109
7.2.3 Thread Allocation Scheme . . . . . . . . . . . . . . . . . . . . . . 111
7.2.4 Flow Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 MCA2 for Cache-Miss Attacks . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 MCA2 for Active-States Attacks . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 118
7.5.2 Cache-Miss Attack Simulation Results . . . . . . . . . . . . . . . 120
7.5.3 Active-State Attack Simulation Results . . . . . . . . . . . . . . 123
8 Conclusion 124
Bibliography 125
-
List of Tables
3.1 Statistics of the pattern sets used in Section 3.5 . . . . . . . . . . . . . . 52
3.2 Summary of experimental results for Snort and ClamAV pattern sets . . 54
4.1 Comparison of Time and Space parameters of different algorithms . . . . 69
4.2 Overview of pattern matching with gzip processing . . . . . . . . . . . . 73
5.1 Storage Requirements (KB) . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Step by step execution of our algorithm on the example of Section 6.1.1 92
7.1 No-drop setting parameters . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 The non-common states ratio . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3 Validation of the thread allocation model of Section 7.2.3 . . . . . . . . 122
xii
-
List of Figures
1.1 The goodput of MCA2 for different attack intensities . . . . . . . . . . . 13
2.1 Example of an Aho-Corasick DFA and methods to store it in memory . 20
2.2 LZ77 example on Yahoo! home page . . . . . . . . . . . . . . . . . . . . 24
3.1 Aho-Corasick DFA toy example . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Illustration of the intra-flow interleaving on a single packet . . . . . . . . 48
3.3 Expansion factor under Truncated CompactDFA . . . . . . . . . . . . . 53
3.4 The distribution of the values C . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 The latency of inter-flow and intra-flow interleaving . . . . . . . . . . . . 56
4.1 Sketch of the gzip 32KB memory buffer . . . . . . . . . . . . . . . . . . 58
4.2 Sketch of the memory buffer in different scenarios . . . . . . . . . . . . . 64
4.3 Illustration of common terms . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Sketch of the memory buffer including the Status Vector . . . . . . . . . 68
4.5 HTTP Compression usage among the Alexa top-site lists . . . . . . . . . 70
5.1 MWM algorithm example . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Pointer scan procedure example . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Skipped Character Ratio (Sr) . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Normalized Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Example of an Aho-Corasick Automaton . . . . . . . . . . . . . . . . . . 89
6.2 The depth of first three states of each failure path . . . . . . . . . . . . . 101
6.3 Comparison between the scan-ratio and compression-ratio . . . . . . . . 102
6.4 Comparison when considering also regular expression matching . . . . . 103
7.1 The effects of a cache-miss attack . . . . . . . . . . . . . . . . . . . . . . 105
xiii
-
7.2 Illustration of MCA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Sketch of a record in the bad packet queue . . . . . . . . . . . . . . . . . 110
7.4 Distribution of cache-misses under normal traffic and under attack. . . . 114
7.5 CDF of the percentage of normal traffic packets . . . . . . . . . . . . . . 115
7.6 The total system throughput for a different number of common states . 116
7.7 CDF of the percentage of mild attack . . . . . . . . . . . . . . . . . . . . 117
7.8 Distribution of maximal average number of active states . . . . . . . . . 119
7.9 Average throughput per thread over time . . . . . . . . . . . . . . . . . . 121
7.10 Goodput of Hybrid-FA and of Hybrid-FA with MCA2 full-drop setup . . 122
-
Chapter 1
Introduction
Deep packet inspection (DPI) consists of inspecting both the packet header and pay-
load and alerting the system when signatures of malicious software appear in the traffic.
These signatures are identified through pattern matching algorithms that are classified
either as string matching, in which the patterns are a set of strings, or regular expression
matching, in which the patterns are defined as regular expressions. DPI is a basic ele-
ment in today’s security tools, such as Network Intrusion Detection/Prevention System
(NIDS/NIPS) or Web application firewall, which are used to detect malicious activi-
ties. Moreover, DPI and its corresponding pattern matching algorithms are also crucial
building blocks for other networking applications such as traffic monitoring and HTTP
load-balancing. Today, the performance of security tools is dominated by the speed of
the underlying pattern matching algorithms [49].
Both string matching and regular expression matching are fundamental problems
in computer science and have been a topic of intensive research for decades. In what
follows, we provide a brief description of the main approaches to these problems and
explain why they are not adequate for contemporary needs.
The fundamental string matching paradigm derives from the Aho-Corasick (AC) [23]
algorithm. This algorithm constructs a deterministic finite automaton (DFA) for de-
tecting all occurrences in any given set of patterns by processing the input in a single
pass, performing a state transition for each input byte. An alternative is the shift-
based paradigm that includes the Boyer-Moore (BM) [32] as well as the modified Wu-
Manber (MWM) [94] algorithms. This paradigm aims at reducing the average-case per-
formance by exploiting a heuristic approach for skipping portions of the input. These
algorithms result in an average sublinear performance.
1
-
2 CHAPTER 1. INTRODUCTION
Likewise, regular expression matching also has two mainstream approaches; these
are based on either a deterministic or a non-deterministic finite automaton (DFA or
NFA). DFA has superior runtime complexity of O(1), and thus a constant time per bit,
as compared with an O(n2) runtime complexity for NFA. NFA, on the other hand, has
superior space complexity of O(n) linear space, as compared with O(2n) exponential
space for DFA. Complexity is calculated as a function of n, the length of the regular
expression [54].
None of the aforementioned traditional solutions is suitable for coping with today’s
requirements, due to problems with scalability, compressed traffic, and resiliency. We
detail these problems below.
Scalability It is essential to increase the speed and reduce the memory requirements
of the pattern matching solutions. As DPI is performed in the critical path of packet
processing, current solutions must handle network speeds of 10–100 Gbps. Moreover,
the solutions must deal with thousands of patterns. For example, the ClamAV [4] virus-
signature database consists of 61K patterns, and the popular Snort NIDS [9] has more
than 30K patterns. Typically, the number of patterns considered by NIDS systems
grows dramatically over time. The size of the pattern database prohibits the use of a
fast memory such as CPU cache or SRAM. Thus, its memory requirement has a direct
effect on its time performance. Current research has focused on reducing the memory
requirement by compressing the corresponding DFA [24, 45, 85, 88, 89]; however, all
proposed techniques suggest a pure-hardware solution, which usually incurs prohibitive
deployment and development costs.
Compressed Traffic Scalable DPI solutions should support increasing rates of In-
ternet traffic. One method for supporting these rates at network servers is to compress
the ongoing traffic, therefore transferring data more efficiently. This method is used
today for compressing HTTP text when transferring pages over the Web. The sharply
increasing number of compressed Web pages is largely motivated by the increase in Web
surfing over mobile devices. Sites such as Yahoo!, Google, MSN, YouTube, Facebook
and others use HTTP compression to enhance the speed of their content download.
For example, in February 2012, W3Techs published a ranking breakdown report [10],
which shows that 44.7% of the Web sites compress their traffic; when focusing on the
top 1 000 sites, a remarkable 83.4% of the sites compress their traffic. HTTP 1.1 has a
-
1.1. METHOD 3
standard-based method of delivering this compressed content; therefore it is supported
by all modern browsers.
The presence of compressed traffic makes Internet traffic harder to analyze by the
DPI routine because doing so requires two time-consuming phases: traffic decompression
and pattern matching. Currently, most security tools fail to analyze compressed traffic.
In some cases they simply do not scan compressed traffic, thus compromising their
original goal of detecting malicious activities. In other cases, the security tools ensure
that there is no compressed traffic by rewriting the HTTP header between the original
client and server; this solution leads to a waste of bandwidth and a higher cost per-bit.
Resiliency Increased traffic rates and compressed traffic are considered legitimate
Internet phenomena. Still, NIDS and Firewall, as the security tools that protect against
malicious users, are naturally becoming a favorable target for illegitimate phenomena
such as denial-of-service attacks. A recent trend is a two-phase combined attack on
security devices: the attackers first neutralize the device, for example, by overwhelming
it with traffic, and then, when it has been knocked down, attack the assets it was
protecting. For example, a recent attack on SONY combined a distributed denial-of-
service (DDoS) attack with credit card theft [16]. The combined attacks usually affect
NIDS and NIPS differently. In NIDS, where the stealth mode device only monitors
the traffic and issues alerts when it detects malicious activity, these DDoS attacks may
force the device to stop inspecting part, or all of the traffic, thereby allowing another
attack to pass unnoticed. In-line NIPS, on the other hand, because it inspects packets
on their critical path, might be forced to drop legitimate traffic, causing, in practice, a
denial-of-service on the servers it is supposed to protect. Bro and Snort, for example,
are both vulnerable to this kind of attack [65].
1.1 Method
The research presented in this thesis requires multiple methods from different disciplines.
A major effort was invested to find new algorithms and design approaches for DPI.
We analyze the performance of the algorithms for both the average case, with normal
routine traffic, and the worst case, with traffic from malicious users. The normal traffic
is obtained from either prepared data traces such as the DARPA MIT traces in [5]
or traces collected from our simulation environment using live Internet traffic. The
-
4 CHAPTER 1. INTRODUCTION
worst-case traffic traces are usually synthesized with respect to the referred attack.
Since most of today’s security-tools source code and pattern sets are free to the public,
an adversary may devise a tailored attack accordingly. Therefore we consider both
adaptive adversaries that can observe the actions of our proposed scheme and respond
accordingly in real time, and oblivious adversaries that cannot.
In addition, we verify the performance of our proposed scheme with simulations and
experiments. This is especially crucial for heuristics that have no theoretical perfor-
mance guarantees. More specifically, we implement all our software solutions and test
their performance using both synthetic and real pattern sets, including patterns and
regular expressions from contemporary security applications, such as Snort [9], Mod-
Security [6], Bro [2] and ClamAV [4]. Our simulation environment includes multi-core
platforms having different cache architectures, to test their influence on our proposed
algorithms.
1.2 Overview of Results
The following subsections briefly describe five research results presented in this thesis.
1.2.1 CompactDFA: Scalable Pattern Matching using Longest Prefix
Match Solutions
Recently, much effort has been devoted to compressing the Aho-Corasick DFA in order
to improve the algorithm’s performance [28, 63, 64, 66, 76, 87–89, 93, 96]. Other works,
such as [77], focused on the construction of the above compressed DFAs using small
intermediate parsers. While most of these works either suggest dedicated hardware
solutions or introduce a non-constant higher processing time, we present a generic DFA
compression algorithm and show how to store the resulting DFA in an off-the-shelf
hardware. Our novel algorithm works on a large class of Aho-Corasick–like DFAs, whose
unique properties are defined in the following chapter. This algorithm reduces the rule
set to the minimum possible size: only one rule per state. A key observation is that in
prior works the state codes were chosen arbitrarily; we take advantage of this degree of
freedom and add information about the state properties in the state code. This allows
us to encode all transitions to a specific state by a single prefix that captures a set
of current states. Moreover, if a state matches more than one rule, the rule with the
-
1.2. OVERVIEW OF RESULTS 5
longest prefix is selected. Thus, our scheme reduces the problem of pattern matching
to the well-studied problem of Longest Prefix Match (LPM).
The reduction in the number of rules comes with a small overhead in the number
of bits used to encode a state. For example, a DFA based on Snort requires 17 bits
to encode each state with an arbitrary piece of code, while when using our scheme it
requires 36; for ClamAV’s DFA the code width increases from 22 bits to 59 bits.
In addition, we present two extensions to our basic scheme. Our first extension,
called CompactDFA for total memory minimization, aims at minimizing the product of
the number of rules and the code width, rather than only the number of rules. This
captures situations in which, at some point, the reduction in the number of rules is
not worth the additional bits in the state code. Specifically, for the above pattern
sets, this extension reduced the memory requirement by up to an additional 40%. The
second extension deals with variable-stride DFAs, which were created to speed up the
inspection process by inspecting more than one symbol at a time; like the first extension,
it requires more rules than the number of states.
One of the main advantages of CompactDFA is that it fits into commercially avail-
able IP-lookup solutions, implying that they may also be used for performing fast pat-
tern matching. We demonstrate the power of this reduction by implementing the Aho-
Corasick algorithm on an IP-lookup chip (as in [84]) and on a TCAM.
Specifically, in TCAM, each rule is mapped to one entry. Since TCAMs are config-
ured with entry width that is a multiple of 36 bits or 40 bits, minimizing the number
of bits to encode a state is less important and the basic CompactDFA that minimizes
the number of rules is more adequate.
We also deal with two obstacles that arise when using TCAMs: the power consump-
tion and the latency induced by the pipeline in the TCAM chip, which is especially
significant since CompactDFA works in a closed-loop manner (that is, where the input
for one lookup depends on the output of a previous lookup). To overcome the latency
problem, we propose two kinds of interleaving executions (namely, inter-flow interleav-
ing and intra-flow interleaving). We show that combining these executions provides low
latency (in the order of a few tens of microseconds) at high throughput. We reduce the
power consumption of the TCAM by taking advantage of the fact that today’s vendors
partition the TCAM to blocks and allow, in every lookup, activation of only some of
these blocks. We suggest dividing the rules to different blocks, each is associated with a
-
6 CHAPTER 1. INTRODUCTION
different subset of symbols. Dividing the rules to blocks in this way reduces the number
of bits required for encoding the symbol field of the rule to the logarithm of the number
of symbols that are mapped to the same block.
The small memory requirement of the compressed rules and the low power consump-
tion enable the use of multiple TCAMs simultaneously, where each performs pattern
matching over different sessions or packets (namely, inter-flow interleaving). Further-
more, one can take advantage of the common multiprocessor architecture of contempo-
rary security tools and design a high throughput solution, applicable to the common
case of multi-sessions/packets. Notice that while state-of-the-art TCAM chips are 5MB
in size, high throughput may be achieved using multiple small TCAM chips. For the
Snort pattern set, we achieve a throughput of 10Gbps and latency of less than 60
microseconds by using 5 small TCAMs of 0.5MB each, and as much as 40Gbps (and
the same latency) with 20 small TCAMs.
This work was published in the proceedings of Infocom 2010 [35].
1.2.2 Space-Efficient Deep Packet Inspection of Compressed
Web Traffic
Networking devices that perform deep packet inspection (DPI) over compressed traffic
need first to decompress the message to inspect its payload. Gzip compression, which is
used for compressed Web traffic, replaces repeated strings with back references, denoted
as pointers, to their prior occurrence within the last 32KB of the text. Therefore, the
decompression process requires a 32KB buffer of the recent decompressed data to keep
all possible bytes that might be back-referenced by the pointers, which causes a major
space penalty. With today’s mid-range firewalls, which are built to support 100K to
200K concurrent connections, keeping a buffer for the 32KB window for each connection
occupies a few gigabytes of main memory. Decompression also causes a time penalty
but this penalty was successfully reduced in [36].
This high memory requirement leaves vendors and network operators with three bad
options: ignore compressed traffic, forbid compression, or divert the compressed traffic
for offline processing. Obviously none of these are acceptable as they present a security
hole or serious performance degradation.
The basic structure of our approach is to keep the 32KB buffer of all connections
compressed, except for the data of the connection whose packet(s) is now being pro-
-
1.2. OVERVIEW OF RESULTS 7
cessed. When a packet arrives, we unpack its connection buffer and process it. One may
naïvely suggest keeping only the original compressed data as it was received. However,
this approach fails since the buffer would contain recursive pointers to data more than
32KB backwards. Our technique, called “Swap Out-of-boundary Pointers" (SOP), packs
the buffer’s connection by combining recent information from both the compressed and
uncompressed 32KB buffer to create a new compressed buffer that contains pointers
that refer only to locations within itself. We show that by employing our technique on
real-life data we reduce the space requirement by a factor of 5 with a time penalty of
26%. Note that while our method modifies the compressed data locally, it is transparent
to both the client and the server.
We further design an algorithm that combines our SOP space-reducing technique
with the ACCH algorithm presented in [36] (the ACCH algorithm accelerates pattern
matching on compressed HTTP traffic). The combined algorithm achieves an improve-
ment of 42% of the time and 79% of the space requirements. The time-space tradeoff
presented by our technique provides the first solution that enables DPI on compressed
traffic at wire speed for network devices such as IPS and Web application firewall.
This work was published in the proceedings of IFIP Networking 2011 [21]. An
extended version was published in the Computer Communications Journal [22].
1.2.3 Shift-based Pattern Matching for Compressed Web Traffic
In this work we provide a method for accelerating DPI over compressed traffic. The
most common algorithm for compressed traffic uses the gzip compression algorithm,
which eliminates repetitions of strings using back references (pointers). The key insight
is to store information produced by the pattern matching algorithm for scanned decom-
pressed traffic, and in the case of pointers, use this data to either find a match or to skip
scanning that area. Recent work [36] presents the ACCH technique for pattern match-
ing on compressed traffic. This technique decompresses the traffic and then uses data
from the decompression phase to accelerate the process. This work analyzed the case
of using the well-known Aho-Corasick (AC) [23] algorithm as a multi-pattern match-
ing technique. The basic Aho-Corasick has good worst-case performance since every
character requires traversing exactly one deterministic finite automaton (DFA) edge.
However, the adaptation for compressed traffic, where some characters represented by
pointers may be skipped, is complicated since the Aho-Corasick requires inspection of
-
8 CHAPTER 1. INTRODUCTION
every byte within the traffic.
Inspired by the insights of that work, we investigate the case of performing DPI over
compressed Web traffic using the shift-based multi-pattern matching technique of the
modified Wu-Manber algorithm [94]. The Wu-Manber algorithm does not scan every
position within the traffic; in fact it shifts (skips) scanning areas in which the algorithm
concludes that no pattern starts.
As a preliminary step, we present an improved version for the Wu-Manber algo-
rithm (see Section 5.1). This modification improves both time and space complexity to
fit the large number of patterns within current pattern sets such as those in the Snort
database [9]. We then present our Shift-based Pattern matching for Compressed traffic
algorithm (SPC ), which accelerates Wu-Manber on compressed traffic. SPC results in
a simpler algorithm, with higher throughput and lower storage overhead than the accel-
erated AC, since the Wu-Manber algorithm basic operation involves shifting (skipping)
some of the traffic. Thus, it is natural to combine Wu-Manber with the idea of shifting
(skipping) some of pointers.
We show that we can skip scanning up to 87.5% of the data and gain a performance
boost of more than 73% as compared to the Wu-Manber algorithm on real Web traffic
and security-tool signatures. Furthermore, we show that the suggested algorithm also
gains a normalized throughput improvement of 51% as compared to ACCH. Finally, the
SPC algorithm reduces the additional space required for previous scan results by half,
by storing only 4KB per connection as compared to the 8KB of ACCH.
This work was published in the proceedings of HPSR 2011 [37].
1.2.4 Decompression-Free Inspection: DPI for Shared Dictionary
Compression over HTTP
Gzip works well as a compression method for each individual HTTP-response, but it
often happens that a lot of common data is shared by a group of pages. This type
of sharing is known as inter-response redundancy. Therefore, next generation Web
compression methods are inter-file, where there is one dictionary that may be referenced
by several files. An example of a compression method that uses a shared dictionary is
Shared Dictionary Compression over HTTP (SDCH).
SDCH [38] was proposed by Google Inc.; thus, Google Chrome (Google’s browser)
supports it by default. According to W3Schools [3], Google’s Chrome browser surpassed
-
1.2. OVERVIEW OF RESULTS 9
Mozilla’s Firefox browser in March 2012 (after it surpassed Microsoft’s Internet Explorer
browser back in April 2011) to become the clear, dominant winner in the latest browser
wars. Thus, the popularity of SDCH compression should increase to the same degree.
Android is a software stack for mobile devices that includes an operating system,
middleware and key applications. The Android operating system, also introduced by
Google, is currently the world’s best-selling smartphone platform, with a 68.1% market
share worldwide [1]. SDCH code appears also in the Android platform and is likely to be
used in the near future. Therefore, a solution for DPI on shared dictionary compressed
data is essential for this platform as well. SDCH is complementary to gzip or Deflate,
i.e., it could be used before applying gzip. On Web pages containing Google search
results, the data size reduction when adding SDCH compression before gzip is about
40% better than gzip alone.
The idea of the shared dictionary approach is to transmit the data that is common
to each response once and after that send only the parts that differ. In SDCH notations,
the common data is called the dictionary and the differences are stored in a delta file.
A dictionary is composed of the data used by the compression algorithm, as well as
metadata describing its scope and lifetime. The scope is specified by the domain and
path attributes; thus, a user may download several dictionaries, even from the same
server.
In this work we present a novel pattern matching algorithm for SDCH. Our algorithm
operates in two phases, the offline phase and the online phase. The offline phase
starts when the device gets the dictionary. In this phase the algorithm uses the Aho-
Corasick [23] pattern matching algorithm to scan the dictionary for patterns and marks
auxiliary information to facilitate the scan of the delta files. Once received, the delta
file is scanned online using Aho-Corasick algorithm. Since the delta file eliminates
repetitions of strings using references to the common strings in the dictionary, our
algorithm tries to skip these reference, so each plain-text byte is scanned only once
(either in the offline or the online phases). We show that we skip up to 99% of the
referenced data and gain up to 56% improvement in the performance of the multi-
pattern matching algorithm, compared with scanning the plain text directly.
We are the first to address the problem of pattern matching algorithms for SDCH. In
addition, we have designed a novel algorithm that scans only a negligible number of bytes
more than once, as our evaluations confirm (see Section 6.3). This is a remarkable result
-
10 CHAPTER 1. INTRODUCTION
considering that bytes in the dictionary can be referenced multiple times by different
positions in one delta file and moreover, by different delta files. The SDCH compression
ratio is about 44%, implying that 56% of the data is copied from the dictionary. Thus,
in a single scan, our algorithm achieves 56% improvement over a plain text file scan.
Our algorithm also has low memory consumption. It stores only the dictionary being
used (along with some auxiliary information per dictionary). In the case of SDCH, since
it was developed for Web traffic, one dictionary usually supports many connections.
In other words, the memory consumption depends on the number of dictionaries and
their sizes and not on the number of connections, in contrast to intra-file compression
methods.
Finally, an important contribution is a mechanism to deal with matching regular-
expression signatures in SDCH-compressed traffic. Regular expression signatures are
becoming increasingly popular due to their superior expressibility [26]. We show how to
use our algorithm as a building block for regular expression matching. Our experiments
show that our regular expression matching mechanism gains a similar 56% boost in
performance.
This work was published in the proceedings of Infocom 2012 [33].
1.2.5 MCA2: Multi Core Architecture for Mitigating
Complexity Attacks
This work deals with complexity attacks, which exploit the gap between the amount of
resources the system requires in processing normal packets and carefully crafted packets
that consume drastically more resources (computing, memory, cache, or other). These
crafted packets, which we call heavy packets, are easy to construct but require very
intensive processing from the system. This implies that a small effort on the attacker’s
side leads to a great effort on the part of the system, which is bound to lose.
We present MCA2—a Multi-Core Architecture for Mitigating Complexity Attacks.
MCA2 essentially isolates the malicious traffic to a fraction of the cores and deals with
legitimate traffic on the remaining cores, which are therefore not affected by the attack.
Our MCA2 system can be configured to mitigate any complexity attack with the
following properties:
1. There are heavy and normal packets, where heavy packets consume considerably
more resources from the security device when being processed.
-
1.2. OVERVIEW OF RESULTS 11
2. There is a method to identify heavy packets that requires very few resources.
3. Packets can be moved efficiently between system cores.
4. There is a special method that handles heavy packets more efficiently than the
method used for normal packets.1
It turns out that there are quite a few complexity attacks that meet these criteria.
However, we restrict our discussion to the DPI component of NIDS/NIPS. We consider
three examples that have the above properties: cache-miss attack on Snort’s signature
detection engine; active states explosion attack on the Hybrid-FA [27] regular expression
detection engine; and forced construction attack on Bro IDS regular expression detection
engine.
We focus on the first example and use it to explain our method and the above-
mentioned properties. We then show that the active states explosion complexity attack
fits our requirements as well. We back up all our findings with experimental results,
showing the benefits of using MCA2 in conjunction with the NIDS. For the force con-
struction attack example, we look at the Bro IDS regular expression detection engine.
Bro takes a lazy approach in order to cope with the large DFA size. Namely, it con-
structs only the DFA parts it actually uses. Normal traffic uses only a small part of the
DFA. Hence, a simple complexity attack forces Bro to construct a large portion of the
DFA, which significantly degrades performance.
With regard to our main example, we target Snort’s DPI engine, which uses some
variant of the Aho-Corasick (AC) [23] algorithm for performing pattern matching. A
complexity attack on the Aho-Corasick algorithm (in a stand-alone environment) is
shown in [34]: Aho-Corasick uses a large DFA that cannot fit entirely in the cache. The
common traffic, however, uses only a very small part of it, resulting in fast memory
references and few cache misses. An attacker can easily craft malicious packets that
cause an exhaustive traversal over the DFA, which in turn pollutes the cache. In this
work, we show for the first time that Snort is indeed vulnerable to this attack: an attack
on its DPI component degrades its overall performance by a factor of 4.2.
After establishing that the threat of this attack is real, we turn to investigate how
MCA2 mitigates such an attack. The key challenge is how to detect and isolate malicious
1This special method usually handles normal packets poorly; otherwise it would have been used bythe system in the first place.
-
12 CHAPTER 1. INTRODUCTION
traffic. This is done in two steps. First, training data is used to identify and mark the
common states of the DFA. These are the frequently-visited states while processing
normal common traffic. Then, for each packet, we count the fraction of non-common
states visited (out of the total number of states traversed by the packet). As soon as
this fraction exceeds a certain threshold, the packet is marked heavy. When the fraction
of heavy packets is above a second threshold, we allocate one or more cores to deal with
them exclusively, while the rest of the cores continue to process only normal traffic (and
to detect heavy packets); each subsequent heavy packet is moved to one of the dedicated
cores. This process isolates the effect of heavy packets and protects the private caches
of the non-dedicated cores from pollution. MCA2 can be further optimized by running
on the dedicated cores an implementation that is optimized for heavy packets (albeit
with penalty in the normal case).
The main performance measure we use is the goodput of the system, namely the
volume of the non-malicious packets that were processed. Our experimental results
are summarized in Fig. 1.1, which shows the system’s goodput under different attack
intensities (namely, in attack intensity of 50%, half of the incoming traffic is malicious).
We compare the performance of MCA2 with two implementations of the Aho-Corasick
algorithm: the first, denoted “Full Matrix AC,” is optimized for well-behaved normal
traffic, and the second, denoted “Compressed AC,” is optimized to work under cache-
miss attack (as described in Section 2.1).
When the system is not allowed to drop packets, MCA2 uses “Full Matrix AC” on
the cores that process normal traffic and “Compressed AC” on the dedicated cores. The
number of cores of each type is dynamically determined as a function of the attack level.
When there is no attack, MCA2 is reduced to “Full Matrix AC.”
We also consider the case when the NIDS/NIPS is allowed to drop packets. Drop-
ping all heavy packets implies that no dedicated threads are required, freeing up all
processing resources for the detection of heavy packets and processing of non-heavy
(mostly legitimate) packets, thus increasing the goodput.
Our experiments show a significant goodput improvement: MCA2 achieves up to
twice the goodput of both implementations, even without dropping packets. Further-
more, it always outperforms a hybrid implementation that chooses the best of the
previous implementations at any given time, with a goodput boost of up to 73%.
As for the second example, we use the regular expressions Hybrid-FA data structure
-
1.3. RELATED WORK 13
0 20 40 60 80 1000
1000
2000
3000
4000
5000
6000
7000
8000
9000
Attack Intensity [%]
Go
od
pu
t [M
bp
s]
Full Matrix ACCompressed ACMCA2
MCA2 with drop
Figure 1.1: The goodput of MCA2 for different attack intensities. MCA2 with no dropsmaintains a balance between all cores.
to illustrate an active states explosion attack. Hybrid-FA uses a single “head-DFA" for
commonly-used states while other parts of the automaton are kept as separate DFAs,
which are activated simultaneously when required. Usually, only the “head-DFA" is
activated. Thus, our complexity attack causes the Hybrid-FA to activate many states
in parallel, therefore causing the system to traverse several states per input byte; this
degrades system throughput significantly. We show that MCA2 in full-drop setup can
mitigate such an attack: our experiments show that under a mild active states explosion
attack, the goodput of the system is increased by a factor of 4.8.
This work was published in the proceedings of ANCS 2012 [20].
1.3 Related Work
The following section provides related work regarding the research presented in this
thesis.
1.3.1 DFA Compression
Intensive efforts have been made to implement compact Aho-Corasick-like DFA that
can fit into faster memory.
Van-Lunteren [89] proposed a novel architecture that supports prioritized tables.
His results are equal to CompactDFA presented in Chapter 3, with a suffix tree that is
limited to depth 2, thus having 25 (66) times more rules than the CompactDFA solution
for Snort (ClamAV). CompactDFA in some sense is a generalization of [89], which
-
14 CHAPTER 1. INTRODUCTION
eliminates all cross-transitions. Song et al. [85] proposed an n-step cache architecture
to eliminate some of the DFA’s cross-transitions. This solution still has 4 (9) times
more rules for Snort (ClamAV) than in CompactDFA. In addition, this solution, like
other hardware solutions [76, 87], uses dedicated hardware and thus is not flexible.
As far as we know, CompactDFA is the first proposed method for reducing the
number of transitions in DFA to the minimum possible one, the number of DFA states.
CompactDFA does not depend on any specific hardware architecture or any statistical
property of data (as opposed to the work of Tuck et al. [88]).
The papers [96] and [86, 93] encode segments of the patterns in the TCAM and do
not encode the DFA rules. However, both solutions require significantly larger TCAM
(especially [93]) and more SRAM (an order of magnitude more). The work of Lin
et al. [66] encodes the DFA rules in TCAM, just as CompactDFA does. CompactDFA
and [66] are based on the same basic observation, that we can eliminate cross-transitions
by using information from the next state label. However, [66] does not use the bits of
the state to encode the information; on the contrary, they just append to each state
code the last m bytes of its corresponding label to eliminate cross-transitions to depth
m. Thus, for depth 4, [66] requires 62 bits while CompactDFA requires only 36 bits,
and hence the solution is not scalable.
A recent work presented a method for state encoding in a TCAM-based implemen-
tation of Aho-Corasick NFA rather than Aho-Corasick DFA [97]. While such a method,
which was developed concurrently with ours, shares some of the insights of our work
(e.g., it also eliminates all failure transitions), it is limited to the TCAM implementa-
tions where CompactDFA may be used with any known IP-lookup solution. In addition,
unlike our work, the method in [97] does not deal with pipelined TCAM implementa-
tions (which are common in contemporary TCAM chips) and therefore suffers from
significant performance degradation if such TCAMs are used.
Following our work, several methods to perform regular expression matching using
TCAM [71, 78] were suggested. These methods rely on the same high-level principle
of our work: exploiting the degree of freedom in the way states are encoded. Since
these methods deal with regular expression rather than exact string matching, they do
not use AC-DFA, but other automata that are geared to handle regular expressions.
Specifically, [71] uses D2FA, while [78] uses both a DFA and a knowledge derived from a
corresponding NFA; both methods then construct a tree (or a forest) structure, which is
-
1.3. RELATED WORK 15
encoded similarly to CompactDFA. Finally, unlike our work, the methods in [71, 78, 97]
do not deal with pipelined TCAM implementations (which are common in contemporary
TCAM chips) and therefore suffer from significant performance degradation if such
TCAMs are used.
Two additional methods that use TCAMs to handle regular expression matching
were presented by Liu et al. [68] and Zheng et al. [99]. These methods present orthogonal
improvements to utilizing TCAMs. Specifically, Liu et al. [68] method is based on
implementing a fast and cheap pre-filter, so that only portion of the traffic should be
fully-inspected; on the other hand, Zheng et al. [99] suggest a technique that parallelizes
the process of using TCAM by smartly dividing the pattern rule set and the flows to
different TCAM blocks. Naturally, these two approaches can be easily combined with
ours.
Finally, we note that [71] introduces the table consolidation technique, which com-
bines entries even if they lead to different states. This technique trades TCAM memory
with a cheaper SRAM memory that stores the different states of each combined entry.
Table consolidation, which requires solving complicated optimization problems, can be
applied also to our results to further reduce TCAM space.
1.3.2 Compressed Web-Traffic
Extensive research has been conducted on performing pattern matching on compressed
files as in [25, 59, 72, 73], but very limited work has been done on compressed traffic.
Requirements for dealing with compressed traffic are: (1) on-line scanning (1-pass),
(2) handling thousands of connections concurrently and (3) working with the LZ77
compression algorithm, which is used by gzip, (as opposed to most papers, which deal
with LZW/LZ78 compressions). To the best of our knowledge, [47, 52] are the only
papers that deal with pattern matching over LZ77. However, the algorithms are for a
single pattern and require two passes over the compressed text (file), which is not an
option in network domains that require “on-the-fly" processing.
Klein and Shapira [60] have suggested a modification to the LZ77 compression al-
gorithm, to change the backward pointer into forward pointers. That modification
makes the pattern matching easier in files and may save some of the space required by
the 32KB buffer for each connection. However, their proposal is not implemented in
today’s HTTP.
-
16 CHAPTER 1. INTRODUCTION
The first paper to analyze the obstacles in dealing with compressed traffic is [36],
but it only accelerated the pattern matching task on compressed traffic and did not
handle the space problem. Furthermore, it still requires the decompression.
Techniques have been developed for in-place compression, the main one being
LZO [75]. While LZO claims to support decompression without memory overhead,
it works with files and assumes that the uncompressed data is available. We assume
decompression of thousands of concurrent connections on-the-fly without having the
uncompressed data available. Thus what is for free in LZO is considered overhead in
compressed Web traffic. Furthermore, while gzip is considered the standard for Web
traffic compression, LZO is not supported by any Web server or Web browser.
1.3.3 DPI Using Multi-Core Platforms
The recent proliferation of multi-core general purpose processors motivated many re-
searchers to reinvestigate well-known problems in this new domain. Among these are
several works that proposed a multi-core solution for DPI processing. These papers’
main focus is on different ways to load balance the system tasks between the available
cores.
Current NIDS/NIPS systems such as Snort [9] and Bro [2] split the load to many
sequential sub-tasks in a pipeline manner. Other works, such as [91], suggest fine-
grained pipelining for parallelizing network applications on multi-core architectures.
This partitioning is effective if the processing cost for each sub-task is similar, which is
usually not the case for NIDS/NIPS.
A different line of research focuses on load balancing the traffic flows equally between
the different cores and performing the inspection in parallel [41, 53, 67, 74, 83]. The load
balancing is based on both the packet header parameters and some layer-7 parameters.
We note that such architectures are orthogonal to our MCA2 algorithm (see Chapter 7)
and may be applied to load balance the work between general threads that process the
normal traffic. If MCA2 is not used in conjunction with these architectures, they are
all vulnerable to complexity attacks.
Becchi et al. [30] focus on DPI and present a performance evaluation scheme for
multiprocessor systems. The proposed design also splits the traffic between several
cores with the same DPI engine that supports regular expression matching. Their
study identifies and evaluates algorithmic and architectural trade-offs and limitations,
-
1.4. SIGNIFICANCE 17
and highlights how the presence of caches affects the overall performance. However, it is
geared at optimizing the normal case and is vulnerable to similar complexity attacks as
those we describe in this work. Such attacks can be mitigated by incorporating MCA2
into this scheme as well.
Another multi-core load-balancing approach is to partition the patterns among the
cores (cf. [90, 95, 98]). Then different DPI algorithms, each specializing in different
kinds of pattern sets, are run on each core. In some cases, the partitioning itself is
done so as to balance the load between the algorithms. It is important to note that
architectures of this kind differ from MCA2 in that each packet is examined by several
cores (each performs only part of the inspection). In addition, they do not take into
account the incoming traffic, and are vulnerable to separate attacks on each core.
1.3.4 Denial-of-Service Mitigation
Kumar et al. [62] present several methods to reduce regular-expressions-based DFA
size. One of the mechanisms used in that paper is based on the assumption that normal
flows rarely match more than the first few symbols of any signature. Thus, the most
frequently visited portions of the automaton are used to build a fast path DFA, and the
rest of the automaton is represented by a separate NFA, which is the slow path. The
authors suggest a solution that is similar to MCA2 in that it handles heavy traffic with
a different algorithm and applies a lightweight classification algorithm to distinguish
between heavy and normal traffic. In addition, [62] proposes to protect against denial-
of-service (DoS) attacks by attaching lower priority to flows with higher probability
of being malicious. Nevertheless, that work analyzes the case of a single core, and
therefore could not benefit from the multi-core properties as MCA2 does. Furthermore,
the proposed protection in [62] fails under a continuous DoS attack because the heavy
packets that receive lower priority eventually overload the system buffer. MCA2 is also
resilient to DoS attacks of longer duration.
1.4 Significance
This thesis provides algorithms and techniques in the field of deep packet inspection for
high performance network security tools. These algorithms focus on three problems:
scalability, compressed traffic, and security-tool resiliency.
-
18 CHAPTER 1. INTRODUCTION
For the first topic, that of scalability, we are the first to provide a scheme that reduces
the pattern matching problem to the well-studied problem of Longest Prefix Matching
(LPM), which may be solved either in TCAM, in commercially available chips, or in
software.
For the second topic, that of compressed traffic, we are the first to address the
problem and to provide a set of state-of-the-art solutions that achieve good theoretical
and practical results.
As for the third topic, we have uncovered and demonstrated weaknesses of preva-
lent security tools for commercial networks, by devising a denial-of-service algorithmic
complexity attack over the Snort network intrusion detection server. Furthermore, we
are the first to incorporate the common multi-core platform architecture to mitigate
complexity attacks over network security tools.
-
Chapter 2
Background
In this chapter we provide background on topics that are relevant to the following
chapters, which are “pattern matching", “compressed traffic" and “complexity attacks".
2.1 DFA based Pattern Matching
DPI is a major component in contemporary security tools, which heavily relies on
pattern matching to detect signatures of malicious traffic. We consider the following
two classes of pattern matching: exact matching and regular expression matching. The
former usually uses a deterministic finite automaton (DFA), while the latter uses either
a DFA or a non-deterministic finite automaton (NFA) for the ongoing inspection of the
input data [54]. A sub-category of the latter class is the Ternary Content Addressable
Memory (TCAM) based regular expression matching, which encodes the DFA rules
using TCAM elements (as discussed in Chapter 3).
In our main example, we mostly focus on the exact matching algorithms, which
use DFA. A DFA is a five-tuple 〈S,Σ, δ, q0, F 〉, where S is a finite set of states, Σ is a
finite set of input symbols, δ : S × Σ → S is a transition function, returning the next
state, given the current state and any symbol from the input, s0 ∈ S is the initial state,
and F ⊆ S is a set of accepting states. Aho-Corasick algorithm provides a method to
build such an automaton (a.k.a. Aho-Corasick DFA) from a set of patterns. Given the
DFA, a packet is inspected by traversing the automaton symbol by symbol from s0; a
pattern is detected if a state in F is reached in this traversal. Fig. 2.1(a) depicts the
Aho-Corasick DFA for the pattern-set {E,BE,BD,BCD,CDBCAB,BCAA}.
In today’s security tools, Aho-Corasick DFAs are huge—e.g., Snort’s Aho-Corasick
19
-
20 CHAPTER 2. BACKGROUND
B
E
CB
E
CBE
C
DE
BC
D
E C
E
B CE
B
C
E
B C
E
BC
B B
s0
s7
s12
s1 s2
s3 s5s4
s14
s13 s6
s8
s9
s10
s11
C
C
E
D
B
E D
D B
C
A
B
A
A
(a) The Aho-Corasick DFA for pattern-set {E, BE, BD, BCD, CDBCAB,BCAA}
(b) Full-matrix Encoding
s 0
s 7s 2
s 5
C
C
E *
D
BD*
s 13
D*B C AB *
A
A*
(B )
(B C )
(B C A)
E *
s 8
(c) Compressed Automaton
S0
B: 2, C: 7, E: 0*, fail: 0
S2
C: 5, D: 0*, E: 0*, fail: 0
S5
A: 13, D: 8*, fail: 0
S7
D: 8, fail: 0
S8
BCAB: 2*, BCA: 13, BC: 5, B:2, fail:0
S13
A: 0*, fail: 0
(d) Compressed Encoding
Figure 2.1: Example of an Aho-Corasick DFA and two methods to store it in mem-ory: non-compressed (full-matrix) encoding, and compressed encoding. The compressedencoding is derived from a compressed automaton, in which fail transitions are takenwithout consuming input symbols, and transitions marked with ‘*’ indicate that a matchwas found.
-
2.1. DFA BASED PATTERN MATCHING 21
DFA has 77, 182 states for 31, 094 patterns—raising the question of how to store it
efficiently in memory. The alternatives naturally trade memory space with execution
time. In addition, most security tools (including Snort) divide their patterns to several
sets, according to the type of traffic.
Snort uses a full-matrix encoding for its Aho-Corasick DFAs as presented in [23]. In
this representation (see Fig. 2.1(b)), transitions are stored in a two-dimensional array
with |S| rows and |Σ| columns. An entry at position (i, j) holds the value of δ(si, j),
implying that the number of bits in each entry is at least log2 |S|. In the typical
case, when the input is inspected one byte at a time, |Σ| = 256, resulting in overall
memory footprint of 256|S| log2 |S|. For Snort’s Aho-Corasick DFAs, this translates
to a combined footprint of 75.15 MB. On the other hand, the main advantage of this
encoding is that a transition consists of a single memory load operation that reveals
directly the next state.
An alternative approach is to implement an AC automaton using the concept of
failure transitions. In such implementations, only part of the outgoing transitions from
each state are stored explicitly. While traversing the automaton, if the transition from
state s with symbol x is not stored explicitly, one will take the failure transition from
s to another state s′ and look for an explicit transition from s′ with x. This process is
repeated until an explicit transition with x is found, resulting in failure paths. Naturally,
since only part of the transitions are stored explicitly, these implementations (sometimes
referred to as AC NFAs) are more compact, but incur higher processing time. A classical
result states that the longest failure path is at most the size of the longest pattern,
and that, regardless of the traffic pattern, the total number of transitions (failure and
explicit) is at most twice the number of symbols. This result does not take into account
the representation of each single state, which determines the time it takes to figure out
whether an explicit rule exists or not.
We use the following definitions regarding this encoding: Let the label of a state
s, denoted by L(s), be the concatenation of symbols along the path from the root to
s. Furthermore, let the depth of a state s be the length of the label L(s). The failure
transition from s is always to a state s′, whose label L(s′) is the longest suffix of L(s)
among all other DFA states. This implies the following property of the Aho-Corasick
DFA:
Property 1. If L(s′) is a suffix of L(s) then there is a failure path (namely, a path
-
22 CHAPTER 2. BACKGROUND
comprised only of failure transitions) from state s to state s′.
The DFA is traversed starting from the root. When the traversal goes through an
accepting state, it indicates that some patterns are a suffix of the input; one of these
patterns always corresponds to the label of the accepting state. Formally, we denote
by s.output the set of patterns matched by state s; if s is not an accepting state then
s.output = ∅. Finally, we denote by scan(s, b), the AC procedure when reading input
symbol b while in state s; namely, transiting to a new state s′ after traversing failure
transitions and a forward transition as necessary, and reporting matched patterns in
case s′.output 6= ∅. scan(s, b) returns the new state s′ as an output. The correctness of
the AC algorithm essentially stems from the following simple property:
Property 2. Let b1, . . . bn be the input, and let s1, . . . , sn be the sequence of states the
AC algorithm goes through, after scanning the symbols one by one (starting from the
root of the DFA). For any i ∈ {0, . . . , n}, L(si) is a suffix of b1, . . . , bi; furthermore, it
is the longest such suffix among all other states of the DFA.
There are other encodings that require more than one memory access, but offer
significant memory reduction. Several such encodings exist in the literature [29, 34, 88].
Fig. 2.1(d) depicts one such alternative, as suggested in [34]; this encoding is based on
a compressed automaton as depicted in Figure 2.1(c).
The construction of AC’s DFA is done in two phases. First, the algorithm builds a
trie of the pattern set: All the patterns are added from the root as chains, where each
state corresponds to a single symbol. When patterns share a common prefix, they also
share the corresponding set of states in the trie. In the second phase, additional edges
are added to the trie. These edges deal with situations where the input does not follow
the current chain in the trie (that is, the next symbol is not an edge of the trie) and
therefore we need to transit to a different chain. In such a case, the edge leads to a
state corresponding to a prefix of another pattern, which is equal to the longest suffix
of the previously matched symbols.
It is sometimes useful to look at the DFA as a directed graph whose vertex set is S
and there is an edge between s1 and s2 with label x if and only if δ(s1, x) = s2. The
input is inspected one symbol at a time: Given that the algorithm is in some state s ∈ S
and the next symbol of the input is x ∈ Σ, the algorithm applies δ(s, x) to get the next
state s′. If s′ is in F (that is, an accepting state) the algorithm indicates that a pattern
was found. In any case, it then transits to the new state s′.
-
2.2. COMPRESSED WEB-TRAFFIC 23
We use the following simple definitions to capture the meaning of a state s ∈ S:
The depth of a state s, denoted depth(s), is the length (in edges) of the shortest path
between s0 and s. The label of a state s, denoted label(s), is the concatenation of the
edge symbols of the shortest path between s0 to s. Further, for every i ≤ depth(s),
suffix(s, i) ∈ Σ∗ (respectively, prefix(s, i) ∈ Σ∗) is the suffix (prefix) of length i of
label(s). The code of a state s, denoted code(s), is the unique number that is associated
with the state, i.e., the number that encodes the state. Traditionally, this number is
chosen arbitrarily; in this work we take advantage of this degree of freedom.
We use the following classification of DFA transitions (cf. [85]):
• Forward transitions are the edges of the trie; each forward transition links a
state of some depth d to a state of depth d+ 1.
• Cross transitions are all other transitions. Each cross transition links a state of
depth d to a state of depth d′ where d′ ≤ d. Cross transitions to the initial state
s0 are also called failure transitions, and cross transitions to states of depth 1
are also called restartable transitions.
2.2 Compressed Web-Traffic
This section provides an overview of the main techniques that are used to compress
Web traffic in the Internet.
2.2.1 Gzip Compression
HTTP 1.1 [19] supports the usage of content-codings to allow a document to be com-
pressed. The RFC suggests three content-codings: gzip, compress and deflate. In fact,
gzip uses deflate as its underlying compression protocol. For the purpose of this the-
sis they are considered the same. Currently gzip and deflate are the common codings
supported by current browsers and Web servers (analyzing captured packets from the
latest versions of both Internet Explorer, FireFox and Chrome browsers shows that
these browsers accept only the gzip and deflate codings).
The gzip algorithm uses a combination of the following compression techniques: first
the text is compressed with the LZ77 algorithm and then the output is compressed with
the Huffman coding. Let us elaborate on the two algorithms:
-
24 CHAPTER 2. BACKGROUND
(a) (b)
Figure 2.2: Example of the LZ77 compression on the beginning of the Yahoo! homepage (a) Original (b) After the LZ77 compression
LZ77 Compression The purpose of LZ77 [100] is to reduce the string presenta-
tion size, by spotting repeated strings within the last 32KB of the uncompressed data.
The algorithm replaces a repeated string by a backward-pointer consisting of a (dis-
tance,length) pair, where distance is a number in [1,32768] (32K) indicating the distance
in bytes of the string and length is a number in [3,258] indicating the length of the re-
peated string. For example, the text: ‘abcdeabc’ can be compressed to: ‘abcde(5,3)’;
namely, “go back 5 bytes and copy 3 bytes from that point". LZ77 refers to the above
pair as “pointer" and to uncompressed bytes as “literals".
Fig. 2.2 depicts an example extracted from the ‘Yahoo!’ home page after LZ77
compression. Note that decompression has a moderate time consumption, since it reads
and copies sequential data blocks, hence relying on spatial locality that requires only a
few memory references.
Huffman Coding The second algorithm used by gzip is the Huffman coding. This
method works on a character-by-character basis, transforming each 8-bit character to a
variable-size codeword ; the more frequent the character is, the shorter its corresponding
codeword. The codewords are coded such as no codeword is a prefix of another, so the
end of each codeword can be easily determined. Dictionaries are provided to facilitate
the translation of binary codewords to bytes.
In the gzip format, Huffman codes both ASCII characters (that is literals) and point-
ers into codewords using two dictionaries, one for the literals and the pointer lengths
and the other for the pointer distances. Huffman may use either fixed or dynamic dic-
tionaries, where the latter gains better compression ratio. The Huffman dictionaries for
the two alphabets appear immediately after the header bits and prior to the compressed
data.
-
2.2. COMPRESSED WEB-TRAFFIC 25
A common implementation of Huffman decoding (cf. zlib [17]) uses two levels of
lookup tables. The first level stores all codewords of a length shorter than 9 bits in a
table of 29 entries that represents all possible inputs; each entry holds the symbol value
and its actual length. If a symbol exceeds 9 bits, there is an additional reference to
a second lookup table. Thus, in most of the cases, decoding a symbol requires only a
single memory reference, while for the less frequent symbols it requires two.
2.2.1.1 Challenges in performing DPI on Compressed HTTP
While transparent to the end-user, compressed Web traffic needs special care by bump-
in-the-wire devices that reside between the server and the client and perform DPI. The
device needs first to decompress the data in order to inspect its payload since there is no
apparent “easy" way to perform DPI over compressed traffic without decompressing the
data in some way. This is mainly because LZ77 is an adaptive compression algorithm,
namely the text represented by each symbol is determined dynamically by the data.
As a result, the same substring is encoded differently depending on its location within
the text. For example the pattern ‘abcdef’ can be expressed in the compressed data by
abcde ∗j (j + 5, 5)f for all possible j < 32763.
One of the main problems with the decompression is its memory requirement; the
straightforward approach requires a 32KB sliding window for each connection. Note
that this requirement is difficult to avoid, since the back-reference pointer can refer to
any point within the sliding window and the pointers may be recursive (i.e., a pointer
may point to an area with a pointer). As opposed to compressed traffic, DPI of non-
compressed traffic requires storing only two or four bytes variable that holds the corre-
sponding DFA state aside of the DFA itself, which is of course stored in any case. Hence,
dealing with compressed traffic poses a significantly higher memory requirement by a
factor of 8 000 to 16 000. Thus, mid-range firewall that handles 100K-200K concurrent
connections (like GTA’s G-800 [12], SonicWall’s Pro 3060 [13] or Stonesoft’s StoneGate
SG-500 [14]), needs 3GB-6GB memory while a high-end firewall that supports 500K-
10M concurrent connections (like the Juniper SRX5800 [15] or the Cisco ASA 5550 or
5580 [11]) would need 15GB-300GB memory only for the task of decompression. This
memory requirement not only imposes high price and infeasibility of the architecture
but also implies on the capability to perform caching or using fast memory chips such
as SRAM. Hence, reducing the space boosts the speed also because faster memory tech-
-
26 CHAPTER 2. BACKGROUND
nology is becoming a viable option, such as SRAM memory. This work deals with the
challenges imposed by that space aspect.
Apart from the space penalty described above, the decompression stage also in-
creases the overall time penalty. However, we note that DPI requires significantly more
time than decompression, since decompression is based on reading consecutive mem-
ory locations and therefore enjoys the benefit of cache block architecture and has low
per-byte read cost, where as DPI employs a very large data structure that is accessed
by reads to non-consecutive memory areas therefore requires expansive main memory
accesses. In [36] we provided an algorithm that takes advantage of information gathered
by the decompression phase in order to accelerate the commonly used Aho-Corasick pat-
tern matching algorithm. By doing so, we significantly reduced the time requirement
of the entire DPI process on compressed traffic.
2.2.2 SDCH Compression
2.2.2.1 The SDCH Framework
SDCH is a new compression mechanism proposed by Google Inc. In SDCH, a dictionary
is downloaded (as a file) by the user agent from the server. The dictionary contains
strings which are likely to appear in subsequent HTTP responses. If, for example, the
header, footer, JavaScript and CSS are stored in a dictionary possessed by both user
agent and server, the server can construct a delta file by substituting these elements with
references to the dictionary, and the user agent can reconstruct the original page from
the delta file using these references. By substituting dictionary references for repeated
elements in HTTP responses, the payload size is reduced and we can save the cross-
payload redundancy. In order to use SDCH, the user agent adds the label SDCH in
the Accept-Encoding field of the HTTP header. The scope of a dictionary is specified
by the domain and path attributes, thus, one server may have several dictionaries and
the user agent has to have a specific dictionary in order to decompress the server’s
compressed traffic. If the user agent already has a dictionary from the negotiated
server, it adds the dictionary id as a value to the header Avail-Dictionary. If the user
agent does not have the specific dictionary that was used by the server, the server sends
an HTTP response with the header Get-Dictionary and the dictionary path; now, the
user agent can construct a request to get the dictionary.
-
2.3. COMPLEXITY ATTACK 27
2.2.2.2 The VCDIFF Compression Algorithm
SDCH encoding is built upon the VCDIFF compression data format. VCDIFF encoding
process uses three types of instructions, called delta instructions: add, run and copy.
add(i, str) means to append to the output i bytes, which are specified in parameter
str. run(i, b) means to append i times the byte b. Finally, copy(p, x) means that
the interval [p, p + x) should be copied from the dictionary (that is, x bytes starting
at position p). The delta file contains the list of instructions with their arguments and
the dictionary is one long string composed of the characters that can be referenced
by the copy instructions in the delta file. In the rest of this thesis, we ignore the
run instruction since it is barely used and can be replaced with an equivalent add for
our purposes.
For example, suppose that the dictionary is DBEAACDBCABC, and the delta file is
given by the following commands:
1. add (3,ABD)
2. copy (0,5)
3. add (1,A)
4. copy (4,5)
5. add (2,AB)
6. copy (9,3)
7. add (4,AACB)
8. copy (5,3)
9. add (1,A)
10. copy (6,3)
Thus, the plain-text that should be considered is therefore (bolded bytes were copied
from the dictionary):
ABDDBEAAAACDBCABABCAACBCDBADBC
2.3 Complexity attack
In a complexity attack, the attacker exploits the system’s worst-case performance, which
differs from the average case that the system was designed for. Crosby and Wallach were
among the first to demonstrate the phenomenon on the commonly-used Open Hash data
-
28 CHAPTER 2. BACKGROUND
structure [43]: an attacker designs an input that requires O(n) elementary operations
per insertion, instead of O(1) operations that are required on the average.
Recent works show that many other systems and algorithms are vulnerable to com-
plexity attacks including QuickSort [70], regular expression matcher [79], intrusion de-
tection systems [34, 48, 82], the Linux route-table cache [92], SSL authentication al-
gorithm [40], and the retransmission algorithm in wireless networks [31]. Complexity
attacks on different components of NIDS/NIPS were suggested in the past. For exam-
ple, Bro maintains a hash table with the IP header fields of packets as keys; thus, by
tailoring the traffic with specific headers, one can cause the hash insert-operation to
last significantly longer, resulting in Bro failure. While in some cases modifying the
algorithm suffices to mitigate the problem (e.g., Crosby and Wallach’s attack can be
solved by using hash functions that are not known to the attacker), this does not hold
in general.
-
Chapter 3
CompactDFA: Generic State
Machine Compression for Scalable
Pattern Matching
In this chapter we propose a novel method to compress deterministic finite automata
(DFA), which is the common data structure for DPI. Compressing the DFA enables
storing the DFA in a faster memory, which in turn gains significant performance boost.
Related background for pattern matching using DFA is provided in Section 2.1. Related
work is in Section 1.3.1.
3.1 The CompactDFA Scheme
In this section we explain our CompactDFA Scheme. We begin by explaining the scheme
output, namely a compact encoding of the DFA and continue by describing the algorithm
and the intuition behind it.
3.1.1 CompactDFA Output
A straightforward encoding of the Aho-Corasick DFA is to store the set of rules (one
rule for each transition) with the following fields:
Current state field Symbol field Next state field
The output of the CompactDFA scheme is a set of compressed rules, such that there
is only one rule per state. This is achieved by cleverly choosing the code of the states.
29
-
30 CHAPTER 3. COMPACTDFA
Unlike traditional AC-like algorithms, in our compact DFA each rule has the following
structure:
Set of current states Symbol Field Next state code
The set of current states of each rule is written in a prefix style, i.e., the rule captures
all states whose code matches a specific prefix. Specifically, for each state s, let N(s)
be the incoming neighborhood of s, namely all states that has an edge to s. For every
state s ∈ S, we have one rule where the current state is the common prefix of the code
of the states in N(s) and the next state is s. Note that the symbol that transfers each
state in N(s) to a state s is common for all the states in N(s) due to AC-like algorithm
properties (see Property 2 in Section 3.1.3).
Fig. 3.1(c) shows the rules produces by CompactDFA on the DFA of Fig. 3.1(a).
For example, Rule 5 in Fig. 3.1(c), which is 〈010**, D, 11010(s11)〉, is the compressed
rule for next state s11 and it replaces three original rules: 〈01000(s3), D, 11010(s11)〉,
〈01001(s5), D, 11010(s11)〉, and 〈01010(s10), D, 11010(s11)〉.
In the compressed set of rules, a code of a state may match multiple rules. Very
similar to forwarding table in IP networks, the rule with the Longest Prefix Match (LPM)
determines the action. In our example, this is demonstrated by looking at Rules 6 and
10 in Fig. 3.1(c). Suppose that the current state is s8, whose code is 00010, and the
symbol is A. Then, Rule 10 is matched since 00*** is a prefix of the current state. In
addition, Rule 6, with current state 000**, is also matched. According to the longest
prefix match rule, Rule 6 determines the next state.
3.1.2 CompactDFA Algorithm
This section describes the encoding algorithm of CompactDFA and gives the intuition
behind each of its three stages: State Grouping (Algorithm 1, Section 3.1.4), Common
Suffix Tree Construction (Algorithm 2, Section 3.1.5), and State and Rule Encoding
(Algorithm 3, Section 3.1.6).
The first stage of our algorithm is based on the following insight: Suppose that each
state s is encoded with its label; our goal is to encode with a single rule the incoming
neighborhood N(s), which should appear in the first field of the rule corresponding to
next state s. Note that the labels of all states in N(s) share a common suffix, which
is the label of s without its last symbol. Thus, by assigning code(N(s)) to be label(s)
without its last symbol, padded with “don’t care” symbols in its beginning, and applying
-
3.1. THE COMPACTDFA SCHEME 31
s0
s12
s13
s1 s6
s7 s10s8
s9
C
C
M
F
B
A B
s5
s4
s2
B
B
C
A DC
S11s3
(a)
B C
B B
00
0
01
s3,s5,s10s4,s8 s2,s6
11
10
s12
0
Connecting
Connecting
Connecting
s0,s1,s7
Connecting
s9,s11,s13
11
(b)
00010 (s2)B100014
Nxt StateSymbolCurrent state
01001 (s5)C000001
01000 (s3)C000102
00000 (s4)B000103
11010 (s11)D010**5
11000 (s9)A000** 6
11100 (s13)F01***7
01010 (s10)C00***8
00001 (s8)B00***9
10100 (s7)A00***10
10010 (s1)M*****11
01100 (s12)C*****12
00011 (s6)B*****13
10000 (s0)******14
00010 (s2)B100014
Nxt StateSymbolCurrent state
01001 (s5)C000001
01000 (s3)C000102
00000 (s4)B000103
11010 (s11)D010**5
11000 (s9)A000** 6
11100 (s13)F01***7
01010 (s10)C00***8
00001 (s8)B00***9
10100 (s7)A00***10
10010 (s1)M*****11
01100 (s12)C*****12
00011 (s6)B*****13
10000 (s0)******14
(c)
Figure 3.1: A toy example. (a) Aho-Corasick DFA for the patterns{EBC,EBBC,BA,BBA,BCD,CF}. Failure and restartable transitions are omitted forclarity. (b) The Common Suffix Tree; (c) The Rules of the compressed DFA.
-
32 CHAPTER 3. COMPACTDFA
a longest suffix match rule, one captures correctly the transitions of the DFA.
For example, consider Fig. 3.1(a). The code of state s7 is BA. N(s7) = {s6, s2},
label(s6) = B and label(s2) = EB; their common suffix is B, and indeed the code of
N(s7) is “***B”. On the other hand, code(N(s9)) = code({s4, s8}) = “**BB”; thus, if
the current state is s4, whose label is EBB, and the symbol is A, the next state is s9
whose corresponding rule has longer suffix than the rule corresponding to s7.
As demonstrated above, the longest suffix match rule should be applied to resolve
conflicts when more than one rule is matched. Intuitively, this encoding is correct since
all incoming edges to a state s (alternatively, all edges from N(s)) share the same suffix,
which is code(N(s)). Moreover, a cross transition edge from a state s with symbol x
always ends up at a state s′ whose label is the longest suffix (among all state labels) of
the concatenation of label(s′) with x.
However, this code is, first and foremost, extremely wasteful (and thus unpractical),
requiring a 32 bit code for the automaton of Fig. 3.1(a) (namely, to encode 4 byte labels)
and hundreds of bits for Snort’s DFA. In addition, it uses a longest suffix match