fpga based string matching for network processing applications janardhan singaraju, john a. chandy...

FPGA Based String Matching for Network Processing ApplicationsJanardhan Singaraju, John A. Chandy

Presented by:Justin RiseboroughAlbert Tirtariyadi

ENGG*3050 RCS Winter 2014March 24, 2014

2

ContentIntroductionString Lookup Cache

◦Architectures◦System Interaction◦Systems comparison

Network Intrusion Detection◦Architectures◦System Interaction◦ Implementations

Critique

3

KeywordsNetwork processingString matchingContent Addressable Memory

(CAM) & CacheBottlenecksFixed-Size/Non-Fixed-Size keysCascading, propagatingParallelism

4

IntroductionString matching are used in search

engines, and network intrusion detection

Network processing applications require frequent string matching for specific keywords

As networks gets faster, it becomes more difficult for GPP to keep up

Bottlenecks are found in memory and also in slow implementation algorithms/methods

5

Current Implementations

Software Algorithms Hardware Implementation

Rabin-Karp◦ Compares hashes of

inputs instead of direct character matching

Knuth-Morris-Pratt◦ Character by character

matching; skips non-matching

Boyer-Moore◦ Uses pre-computed

functions to determine shifting distance

Finite automata methods◦ Translates finite

automata graphs to FPGA circuitry

CAMs◦ Caches and lookup

tables

◦ Cellular automata

◦ Finite state machines

STRING LOOKUP CACHE

Section I

6

7

String Lookup CacheHardware implementation based on CAMs,

cellular automaton and cachingCaches retain frequently used values,

reducing the need to constantly look up address values

Compatible with parallel processing, prefix sharing and pattern partitioning

Very high throughputs with low area overheadDrawback of CAMs and hardware caches is

the reliance on fixed-size keys◦ Implementations for non-fixed-size keys requires

additional overhead

8

System Architecture

9

Content Addressable Memory

Hardware implementation of 2D [associative] arrays/ADT

In VLSI, the cells are transistors

In an FPGA, storage cells are registers, comparators are XOR gates

10

CAM as Character Match Array (CMA)

Takes characters from the network processor on successive clock cycles

Columns corresponds to a character in keyword

Input character is applied simultaneously to all n columns

Column match signal becomes high if all input bits matches

Storage cell used to indicate end of keyword

11

Processor Element (PE) ArrayAn array of finite state machines that

carries out the approximate match algorithm

May contain multiple keywords from the CAM

Takes the match signals from the CAM and sets a PE flag which are forwarded to subsequent PEs

Evaluates entire input strings in linear time relative to the size of the input stream

12

CMA and PE Interaction

13

Map Table and OutputsThe map table takes

the PE# and outputs the address to the value or an indirect pointer to the value object

The map table has as many slots as there are PEs

If words are too long, it can cause holes in the map table

14

System Interaction

15

Implementations Comparison

FPGA Implementation Software Implementation

Number of characters Number of characters

256

512 1024 256

512 1024

Slices 2403 4812 9880

Frequency (MHz)

380.1 476.9 460.2

Time per search

(ns)

1128 1305 1582

Throughput (Gb/s)

12.2 15.3 14.7 Throughput (Gb/s)

0.043 0.037 0.030

Searches per second

254 M

318 M 307 M

Searches per

second

887K 766K 632K

Xilinx Virtex-II Pro FPGA (XC2VP230-7)

1GHz PowerPC Computer

NETWORK INTRUSION DETECTION

Section II

16

17

Network Intrusion DetectionThe process of identifying and

analyzing packets that may contain threats to the organization’s network

Time consuming process that grows quickly as defined rule-set or signatures grows large

String matching is the most computationally intensive part of the intrusion detection◦Every incoming packet is compared against

several pre-defined signatures

18

Problems in the CAM ArchitectureCAM-based designs cannot easily

handle regular expressionsNIDs signatures are not of a fixed-

size◦(ie. CAM contains FOO and BAR, input

stream is AFOOBARCD. In a 3-character size setup, the comparisons will be made against AFO, OBA and RCD; none of these will match and will slip right through the detection system)

CAM arrays are very large in area

19

Proposed SolutionUse discrete comparators instead of

CAMs◦Sacrifices the ability to update signatures

dynamically; a fair tradeoff as signatures change relatively infrequently

Use p-rows of comparators for parallelism to match several characters in one clock cycle

Remove the aligned keyword approach as incoming streams may not be aligned to a certain size boundary

20

System Architecture

21

Processor Architecture

22

Processor Architecture

23

Processor Element FlowStart at the beginning

of the signatureBased on previous PE

and current PEIf previous signal and

current signal is a match, propagate match signal until end of signature

At the end of the signature, if entire signature match, flag the sig_match output

24

Signature Match Processor Example

Input string ‘144’ performed over 2 clock cycles

‘1’ is checked in first cycle, sets off a match signal into the SMA

‘4’ is checked in second cycle, sets off match signal into the SMA

Match signal for ‘1’ is present from previous clock cycle

25

Signature Match Processor Example

The ‘4’ is duplicated, so it simply propagates the first match signal to the second as a carry

Since this is the end of the signature, the output is a match due to the propagated match signals && sig_end

26

Address Output LogicIn order for the SMP to be useful,

we also need to know which signatures caused the match

This is handled by the word match buffer, which maintains the position of the signature match

When the last character being processed has been reached, the match address output logic begins working on the buffer entries

27

Address Output Logic A binary tree is used for

the matching signatures Decoding starts, and a

signal is sent to the control circuitry stating there are matches

A pointer then propagates up the tree, generating a bit of the final address based on matches

Binary trees are fast and efficient, time to process is ~M cycles where M is the number of matches

28

FPGA ImplementationAs parallelism

increases, throughput increases, frequency decreases due to complexity

As characters increases, area increases, frequency decreases and throughput decreases

29

Implementation Comparison

30

CritiqueNew terms and

unknown works referred to

Difficult to follow in some areas due to inconsistencies and how the topic is presented

Lots of procedure / methodology on implementation

Very detailed worksGood examples to

strengthen theoretical explanations

Implementation data given for comparison purposes

QUESTIONS?

31

32

ReferencesAll figures and information used

in this presentation pulled from the article

Janardhan Singaraju, John A. Chandy*, FPGA Based String Matching For Network Processing, ScienceDirect Microprocessors and Microsystems, December 14, 2007

fpga based string matching for network processing applications janardhan singaraju, john a. chandy...

Documents