deep packet/flow analysis using gpus · 2018-07-23 · deep packet/flow analysis using gpus qian...

Deep Packet/Flow Analysis using GPUs Qian Gong, Wenji Wu, Phil DeMar (Fermilab)

Motivation Performance

Challenges

• Single-threaded packet analysis system has limited performance,e.g., Snort: 0.2 Gbit/s when perform DPA

• Multi-queue NICs and multicore systems provide opportunities.

Multithreading GDPFA Framework

GDPFA Key Mechanism

Conclusion

• Network packet monitoring and deep packet analysis (DPA) arewidely used to support applications such as intrusion detection,surveillance, traffic control and statistics gathering.

• The growing network traffic rate and rising sophistication in the typeof network attacks are pushing for faster and more complex (e.g.,flow-based) packet processing systems.

• GPUs work exceptionally well for network packet analysis; but asolid flow-based GPU packet inspection tool is missing.

Challenges in DPA for TCP Packets

• Payload of packet affiliated to the same TCP stream need to beassembled before matching against pre-defined patterns.

• Conventional packet normalizers have to buffer all packets followinga missing packet, until they become in-sequence again, to preventTCP fragmentation evasion attacks.

T T C A

T

T

A T A KC

A K

ATC T

A

KAlready received and forwarded data

Buffered data

New data

Internet Router

Internal networkPort mirrorSwitch

Network AnalyzerMulti-GbE

NIC

RSS

RQ1

Core 2

Core 2

… …RQ2

Challenges in Packet Capturing

…

….

Captured Data in Queue

Cap

ture

th

read

1

Cap

ture

th

read

2

Cap

ture

th

read

n

RQ 1

. .

RQ 2

RQ n

Mul

ti-qu

eue

NIC

Mul

ti-co

re

Host

sys

tem

s

GPU Domain

Packet buffer 2

Flow Classification

TCP Stream Reassembly

Pattern Matching

Stream Processing

connec next

…

connection records

next hash bucket

hash(4-tuple)

TCB Connection Table

hk state

Statistical Analysis

Packet buffer 1

Key FunctionsFlow Classification and Reordering:• Inter-flow classification: sort streams according to their TCP 4-tuple• Intra-flow reorder: sort same-stream affiliated packets by their sequence numberPacket Normalization: drop duplicate packets and merge the overlapping payloadPattern Matching: match the payload against fixed stringsStream Processing: keep the internal states between consecutive batches

Flow Classification

raw packet

packets in flow and sequence order

sort by (4-tuple|seq #)

p3 p2 p1 p4 p5 p9p6p7p8 p4

p1 p3 p7 p2 p4 p9p8p6p4 p5

next packet array

flow identifier

1 1 1 2 2 3332 2

filter + scan

Stream Normalization

3 7 4 5 98end end endn/a

bytes of overlapping data (prefer new data) scan seq #

0 n1 n2 0 n3 n6n50n4n/a

Pattern Matching

Intra-batch: AC automaton

Keywords: X = {he, his, she, hers}

state transition automaton

• One thread per packet• Each thread scans extra n bytes

towards its consecutive packet, where n is the maximum branch length of the AC tree

Inter-batch: AC automaton &AC-suffix automaton

thread k

thread k+1

thread k+2

Received packetsNew packets

AC-Suffix-Tree

Suffix set of X: {e,is,s,he,ers,rs}

Out-of-order Packets

case 1AC state

AC-suffix state

case 2

AC-suffix state case 3

AC state

Parallel sorting for intra-batch flow classification and packet reordering.

0 1 3 6 9

4 7

2 5 8

h e r s{hers}

{his}

{she}

seh

is

0

1 2 3 4 5

6

e i r s

7

hr

s8 9

e s s10

AC and AC-suffix automaton to preserve the inter-batch pattern matching states.

Traffic source: real traffics mirrored from the Fermilab border router, with a mean packet length 1042-byte.Base system: Intel Xeon CPU E5-2650 v3, NVIDIA GPU K40Pattern for searching: 2120 string extracted from Snort rules

• Develop a highly efficient packet/flow analysis tool hat fully utilizes the parallelism of multicore systems, NICs, and GPUs.

• Present a novel GPU-centric TCP state management and stream reassembly framework.

• Perform on-the-fly DPA without requiring dropping or buffering the OOO packets by implementing an AC-suffix-tree method on GPU.

BPF filter Output

Out-of-order (OOO) packet

Comparison to Existing ToolsSnort1 AC-suffix GASPP3 Ours

Computing platform CPU CPU GPU GPUMethods Stream

ReassemblySplit

detectionIntra-batch

stream reassembly

In-batch streamreassembly +

inter-batch split detection

Detection over OOO packets

✔ ✔ limited ✔

Resistance to fragmentation flood

N Y N Y

Throughput Low Low High High

[1] Roesch, Martin. "Snort: Lightweight intrusion detection for networks." Lisa. Vol. 99. No. 1. 1999.[2] Chen, Xinming, et al. ``AC-suffix-tree: Buffer free string matching on out-of-sequence packets." Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on. IEEE, 2011.[3] Vasiliadis, Giorgos, et al. "Design and Implementation of a Stateful Network Packet Processing Framework for GPUs." IEEE/ACM Transactions on Networking (2016).

GPU CPUTCP Reassembly 552.28 Gbit/s 4.55 Gbit/s (Libnids)Pattern Matching 62.53 Gbit/s 0.271 Gbit/s (Snort)

Raw Processing Throughput

GPU (8-capture-threads) CPU (single-threading) CPU (8-threads)w/ pkt transfer wo/ pkt transfer Snort1 CPU-AC-suffix2 CPU-AC-suffix

31.95 Gbit/s 54.61 Gbit/s 0.248 Gbit/s 0.613 Gbit/s 3.575 Gbit/s

End-to-End Throughput

Selective Prior Work[1] Wenji Wu, et al. “Deep Packet Analysis using GPUs”. GTC 2017.[2] Wenji Wu, Phil DeMar “Wirecap: a novel packet capture engine for commodity NICs in high-speed networks”. Proceedings of the 2014 Conference on Internet Measurement Conference. ACM, 2014.

deep packet/flow analysis using gpus · 2018-07-23 · deep packet/flow analysis using gpus qian...

Documents