deep packet/flow analysis using gpus · 2018-07-23 · deep packet/flow analysis using gpus qian...

1
Deep Packet/Flow Analysis using GPUs Qian Gong, Wenji Wu, Phil DeMar (Fermilab) Motivation Performance Challenges Single-threaded packet analysis system has limited performance, e.g., Snort: 0.2 Gbit/s when perform DPA Multi-queue NICs and multicore systems provide opportunities. Multithreading GDPFA Framework GDPFA Key Mechanism Conclusion Network packet monitoring and deep packet analysis (DPA) are widely used to support applications such as intrusion detection, surveillance, traffic control and statistics gathering. The growing network traffic rate and rising sophistication in the type of network attacks are pushing for faster and more complex (e.g., flow-based) packet processing systems. GPUs work exceptionally well for network packet analysis; but a solid flow-based GPU packet inspection tool is missing. Challenges in DPA for TCP Packets Payload of packet affiliated to the same TCP stream need to be assembled before matching against pre-defined patterns. Conventional packet normalizers have to buffer all packets following a missing packet, until they become in-sequence again, to prevent TCP fragmentation evasion attacks. T T C A T T A T A K C A K A T C T A K Already received and forwarded data Buffered data New data Internet Router Internal network Port mirror Switch Network Analyzer Multi- GbE NIC RSS RQ1 Core 2 Core 2 RQ2 Challenges in Packet Capturing . Captured Data in Queue Capture thread 1 Capture thread 2 Capture thread n RQ 1 . . RQ 2 RQ n Multi-queue NIC Multi-core Host systems GPU Domain Packet buffer 2 Flow Classification TCP Stream Reassembly Pattern Matching Stream Processing connec next connection records next hash bucket hash(4-tuple) TCB Connection Table h k state Statistical Analysis Packet buffer 1 Key Functions Flow Classification and Reordering: Inter-flow classification: sort streams according to their TCP 4-tuple Intra-flow reorder: sort same-stream affiliated packets by their sequence number Packet Normalization: drop duplicate packets and merge the overlapping payload Pattern Matching: match the payload against fixed strings Stream Processing: keep the internal states between consecutive batches Flow Classification raw packet packets in flow and sequence order sort by (4-tuple|seq #) p3 p2 p1 p4 p5 p9 p6 p7 p8 p4 p1 p3 p7 p2 p4 p9 p8 p6 p4 p5 next packet array flow identifier 1 1 1 2 2 3 3 3 2 2 filter + scan Stream Normalization 3 7 4 5 9 8 end end end n/a bytes of overlapping data (prefer new data) scan seq # 0 n 1 n 2 0 n 3 n 6 n 5 0 n 4 n/a Pattern Matching Intra-batch: AC automaton Keywords: X = {he, his, she, hers} state transition automaton One thread per packet Each thread scans extra n bytes towards its consecutive packet, where n is the maximum branch length of the AC tree Inter-batch: AC automaton &AC-suffix automaton thread k thread k+1 thread k+2 Received packets New packets AC-Suffix-Tree Suffix set of X: {e,is,s,he,ers,rs} Out-of-order Packets case 1 AC state AC-suffix state case 2 AC-suffix state case 3 AC state Parallel sorting for intra-batch flow classification and packet reordering. 0 1 3 6 9 4 7 2 5 8 h e r s {hers} {his} {she} s e h i s 0 1 2 3 4 5 6 e i r s 7 h r s 8 9 e s s 10 AC and AC-suffix automaton to preserve the inter-batch pattern matching states. Traffic source: real traffics mirrored from the Fermilab border router, with a mean packet length 1042-byte. Base system: Intel Xeon CPU E5-2650 v3, NVIDIA GPU K40 Pattern for searching: 2120 string extracted from Snort rules Develop a highly efficient packet/flow analysis tool hat fully utilizes the parallelism of multicore systems, NICs, and GPUs. Present a novel GPU-centric TCP state management and stream reassembly framework. Perform on-the-fly DPA without requiring dropping or buffering the OOO packets by implementing an AC-suffix-tree method on GPU. BPF filter Output Out-of-order (OOO) packet Comparison to Existing Tools Snort 1 AC-suffix GASPP 3 Ours Computing platform CPU CPU GPU GPU Methods Stream Reassembly Split detection Intra-batch stream reassembly In-batch stream reassembly + inter-batch split detection Detection over OOO packets limited Resistance to fragmentation flood N Y N Y Throughput Low Low High High [1] Roesch, Martin. "Snort: Lightweight intrusion detection for networks." Lisa. Vol. 99. No. 1. 1999. [2] Chen, Xinming, et al. ``AC-suffix-tree: Buffer free string matching on out-of-sequence packets." Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on. IEEE, 2011. [3] Vasiliadis, Giorgos, et al. "Design and Implementation of a Stateful Network Packet Processing Framework for GPUs." IEEE/ACM Transactions on Networking (2016). GPU CPU TCP Reassembly 552.28 Gbit/s 4.55 Gbit/s (Libnids) Pattern Matching 62.53 Gbit/s 0.271 Gbit/s (Snort) Raw Processing Throughput GPU (8-capture-threads) CPU (single-threading) CPU (8-threads) w/ pkt transfer wo/ pkt transfer Snort 1 CPU-AC-suffix 2 CPU-AC-suffix 31.95 Gbit/s 54.61 Gbit/s 0.248 Gbit /s 0.613 Gbit/s 3.575 Gbit/s End-to-End Throughput Selective Prior Work [1] Wenji Wu, et al. “Deep Packet Analysis using GPUs”. GTC 2017. [2] Wenji Wu, Phil DeMar “Wirecap: a novel packet capture engine for commodity NICs in high-speed networks”. Proceedings of the 2014 Conference on Internet Measurement Conference. ACM, 2014.

Upload: others

Post on 20-Apr-2020

15 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Deep Packet/Flow Analysis using GPUs · 2018-07-23 · Deep Packet/Flow Analysis using GPUs Qian Gong, Wenji Wu, Phil DeMar (Fermilab) Motivation Performance Challenges • Single-threaded

Deep Packet/Flow Analysis using GPUs Qian Gong, Wenji Wu, Phil DeMar (Fermilab)

Motivation Performance

Challenges

• Single-threaded packet analysis system has limited performance,e.g., Snort: 0.2 Gbit/s when perform DPA

• Multi-queue NICs and multicore systems provide opportunities.

Multithreading GDPFA Framework

GDPFA Key Mechanism

Conclusion

• Network packet monitoring and deep packet analysis (DPA) arewidely used to support applications such as intrusion detection,surveillance, traffic control and statistics gathering.

• The growing network traffic rate and rising sophistication in the typeof network attacks are pushing for faster and more complex (e.g.,flow-based) packet processing systems.

• GPUs work exceptionally well for network packet analysis; but asolid flow-based GPU packet inspection tool is missing.

Challenges in DPA for TCP Packets

• Payload of packet affiliated to the same TCP stream need to beassembled before matching against pre-defined patterns.

• Conventional packet normalizers have to buffer all packets followinga missing packet, until they become in-sequence again, to preventTCP fragmentation evasion attacks.

T T C A

T

T

A T A KC

A K

ATC T

A

KAlready received and forwarded data

Buffered data

New data

Internet Router

Internal networkPort mirrorSwitch

Network AnalyzerMulti-GbE

NIC

RSS

RQ1

Core 2

Core 2

… …RQ2

Challenges in Packet Capturing

….

Captured Data in Queue

Cap

ture

th

read

1

Cap

ture

th

read

2

Cap

ture

th

read

n

RQ 1

. .

RQ 2

RQ n

Mul

ti-qu

eue

NIC

Mul

ti-co

re

Host

sys

tem

s

GPU Domain

Packet buffer 2

Flow Classification

TCP Stream Reassembly

Pattern Matching

Stream Processing

connec next

connection records

next hash bucket

hash(4-tuple)

TCB Connection Table

hk state

Statistical Analysis

Packet buffer 1

Key FunctionsFlow Classification and Reordering:• Inter-flow classification: sort streams according to their TCP 4-tuple• Intra-flow reorder: sort same-stream affiliated packets by their sequence numberPacket Normalization: drop duplicate packets and merge the overlapping payloadPattern Matching: match the payload against fixed stringsStream Processing: keep the internal states between consecutive batches

Flow Classification

raw packet

packets in flow and sequence order

sort by (4-tuple|seq #)

p3 p2 p1 p4 p5 p9p6p7p8 p4

p1 p3 p7 p2 p4 p9p8p6p4 p5

next packet array

flow identifier

1 1 1 2 2 3332 2

filter + scan

Stream Normalization

3 7 4 5 98end end endn/a

bytes of overlapping data (prefer new data) scan seq #

0 n1 n2 0 n3 n6n50n4n/a

Pattern Matching

Intra-batch: AC automaton

Keywords: X = {he, his, she, hers}

state transition automaton

• One thread per packet• Each thread scans extra n bytes

towards its consecutive packet, where n is the maximum branch length of the AC tree

Inter-batch: AC automaton &AC-suffix automaton

thread k

thread k+1

thread k+2

Received packetsNew packets

AC-Suffix-Tree

Suffix set of X: {e,is,s,he,ers,rs}

Out-of-order Packets

case 1AC state

AC-suffix state

case 2

AC-suffix state case 3

AC state

Parallel sorting for intra-batch flow classification and packet reordering.

0 1 3 6 9

4 7

2 5 8

h e r s{hers}

{his}

{she}

seh

is

0

1 2 3 4 5

6

e i r s

7

hr

s8 9

e s s10

AC and AC-suffix automaton to preserve the inter-batch pattern matching states.

Traffic source: real traffics mirrored from the Fermilab border router, with a mean packet length 1042-byte.Base system: Intel Xeon CPU E5-2650 v3, NVIDIA GPU K40Pattern for searching: 2120 string extracted from Snort rules

• Develop a highly efficient packet/flow analysis tool hat fully utilizes the parallelism of multicore systems, NICs, and GPUs.

• Present a novel GPU-centric TCP state management and stream reassembly framework.

• Perform on-the-fly DPA without requiring dropping or buffering the OOO packets by implementing an AC-suffix-tree method on GPU.

BPF filter Output

Out-of-order (OOO) packet

Comparison to Existing ToolsSnort1 AC-suffix GASPP3 Ours

Computing platform CPU CPU GPU GPUMethods Stream

ReassemblySplit

detectionIntra-batch

stream reassembly

In-batch streamreassembly +

inter-batch split detection

Detection over OOO packets

✔ ✔ limited ✔

Resistance to fragmentation flood

N Y N Y

Throughput Low Low High High

[1] Roesch, Martin. "Snort: Lightweight intrusion detection for networks." Lisa. Vol. 99. No. 1. 1999.[2] Chen, Xinming, et al. ``AC-suffix-tree: Buffer free string matching on out-of-sequence packets." Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on. IEEE, 2011.[3] Vasiliadis, Giorgos, et al. "Design and Implementation of a Stateful Network Packet Processing Framework for GPUs." IEEE/ACM Transactions on Networking (2016).

GPU CPUTCP Reassembly 552.28 Gbit/s 4.55 Gbit/s (Libnids)Pattern Matching 62.53 Gbit/s 0.271 Gbit/s (Snort)

Raw Processing Throughput

GPU (8-capture-threads) CPU (single-threading) CPU (8-threads)w/ pkt transfer wo/ pkt transfer Snort1 CPU-AC-suffix2 CPU-AC-suffix

31.95 Gbit/s 54.61 Gbit/s 0.248 Gbit/s 0.613 Gbit/s 3.575 Gbit/s

End-to-End Throughput

Selective Prior Work[1] Wenji Wu, et al. “Deep Packet Analysis using GPUs”. GTC 2017.[2] Wenji Wu, Phil DeMar “Wirecap: a novel packet capture engine for commodity NICs in high-speed networks”. Proceedings of the 2014 Conference on Internet Measurement Conference. ACM, 2014.