fpga based string matching for network processing applications janardhan singaraju, john a. chandy...
TRANSCRIPT
FPGA Based String Matching for Network Processing ApplicationsJanardhan Singaraju, John A. Chandy
Presented by:Justin RiseboroughAlbert Tirtariyadi
ENGG*3050 RCS Winter 2014March 24, 2014
2
ContentIntroductionString Lookup Cache
◦Architectures◦System Interaction◦Systems comparison
Network Intrusion Detection◦Architectures◦System Interaction◦ Implementations
Critique
3
KeywordsNetwork processingString matchingContent Addressable Memory
(CAM) & CacheBottlenecksFixed-Size/Non-Fixed-Size keysCascading, propagatingParallelism
4
IntroductionString matching are used in search
engines, and network intrusion detection
Network processing applications require frequent string matching for specific keywords
As networks gets faster, it becomes more difficult for GPP to keep up
Bottlenecks are found in memory and also in slow implementation algorithms/methods
5
Current Implementations
Software Algorithms Hardware Implementation
Rabin-Karp◦ Compares hashes of
inputs instead of direct character matching
Knuth-Morris-Pratt◦ Character by character
matching; skips non-matching
Boyer-Moore◦ Uses pre-computed
functions to determine shifting distance
Finite automata methods◦ Translates finite
automata graphs to FPGA circuitry
CAMs◦ Caches and lookup
tables
◦ Cellular automata
◦ Finite state machines
STRING LOOKUP CACHE
Section I
6
7
String Lookup CacheHardware implementation based on CAMs,
cellular automaton and cachingCaches retain frequently used values,
reducing the need to constantly look up address values
Compatible with parallel processing, prefix sharing and pattern partitioning
Very high throughputs with low area overheadDrawback of CAMs and hardware caches is
the reliance on fixed-size keys◦ Implementations for non-fixed-size keys requires
additional overhead
8
System Architecture
9
Content Addressable Memory
Hardware implementation of 2D [associative] arrays/ADT
In VLSI, the cells are transistors
In an FPGA, storage cells are registers, comparators are XOR gates
10
CAM as Character Match Array (CMA)
Takes characters from the network processor on successive clock cycles
Columns corresponds to a character in keyword
Input character is applied simultaneously to all n columns
Column match signal becomes high if all input bits matches
Storage cell used to indicate end of keyword
11
Processor Element (PE) ArrayAn array of finite state machines that
carries out the approximate match algorithm
May contain multiple keywords from the CAM
Takes the match signals from the CAM and sets a PE flag which are forwarded to subsequent PEs
Evaluates entire input strings in linear time relative to the size of the input stream
12
CMA and PE Interaction
13
Map Table and OutputsThe map table takes
the PE# and outputs the address to the value or an indirect pointer to the value object
The map table has as many slots as there are PEs
If words are too long, it can cause holes in the map table
14
System Interaction
15
Implementations Comparison
FPGA Implementation Software Implementation
Number of characters Number of characters
256
512 1024 256
512 1024
Slices 2403 4812 9880
Frequency (MHz)
380.1 476.9 460.2
Time per search
(ns)
1128 1305 1582
Throughput (Gb/s)
12.2 15.3 14.7 Throughput (Gb/s)
0.043 0.037 0.030
Searches per second
254 M
318 M 307 M
Searches per
second
887K 766K 632K
Xilinx Virtex-II Pro FPGA (XC2VP230-7)
1GHz PowerPC Computer
NETWORK INTRUSION DETECTION
Section II
16
17
Network Intrusion DetectionThe process of identifying and
analyzing packets that may contain threats to the organization’s network
Time consuming process that grows quickly as defined rule-set or signatures grows large
String matching is the most computationally intensive part of the intrusion detection◦Every incoming packet is compared against
several pre-defined signatures
18
Problems in the CAM ArchitectureCAM-based designs cannot easily
handle regular expressionsNIDs signatures are not of a fixed-
size◦(ie. CAM contains FOO and BAR, input
stream is AFOOBARCD. In a 3-character size setup, the comparisons will be made against AFO, OBA and RCD; none of these will match and will slip right through the detection system)
CAM arrays are very large in area
19
Proposed SolutionUse discrete comparators instead of
CAMs◦Sacrifices the ability to update signatures
dynamically; a fair tradeoff as signatures change relatively infrequently
Use p-rows of comparators for parallelism to match several characters in one clock cycle
Remove the aligned keyword approach as incoming streams may not be aligned to a certain size boundary
20
System Architecture
21
Processor Architecture
22
Processor Architecture
23
Processor Element FlowStart at the beginning
of the signatureBased on previous PE
and current PEIf previous signal and
current signal is a match, propagate match signal until end of signature
At the end of the signature, if entire signature match, flag the sig_match output
24
Signature Match Processor Example
Input string ‘144’ performed over 2 clock cycles
‘1’ is checked in first cycle, sets off a match signal into the SMA
‘4’ is checked in second cycle, sets off match signal into the SMA
Match signal for ‘1’ is present from previous clock cycle
25
Signature Match Processor Example
The ‘4’ is duplicated, so it simply propagates the first match signal to the second as a carry
Since this is the end of the signature, the output is a match due to the propagated match signals && sig_end
26
Address Output LogicIn order for the SMP to be useful,
we also need to know which signatures caused the match
This is handled by the word match buffer, which maintains the position of the signature match
When the last character being processed has been reached, the match address output logic begins working on the buffer entries
27
Address Output Logic A binary tree is used for
the matching signatures Decoding starts, and a
signal is sent to the control circuitry stating there are matches
A pointer then propagates up the tree, generating a bit of the final address based on matches
Binary trees are fast and efficient, time to process is ~M cycles where M is the number of matches
28
FPGA ImplementationAs parallelism
increases, throughput increases, frequency decreases due to complexity
As characters increases, area increases, frequency decreases and throughput decreases
29
Implementation Comparison
30
CritiqueNew terms and
unknown works referred to
Difficult to follow in some areas due to inconsistencies and how the topic is presented
Lots of procedure / methodology on implementation
Very detailed worksGood examples to
strengthen theoretical explanations
Implementation data given for comparison purposes
QUESTIONS?
31
32
ReferencesAll figures and information used
in this presentation pulled from the article
Janardhan Singaraju, John A. Chandy*, FPGA Based String Matching For Network Processing, ScienceDirect Microprocessors and Microsystems, December 14, 2007