deep packet/flow analysis using gpus · 2018-07-23 · deep packet/flow analysis using gpus qian...
TRANSCRIPT
Deep Packet/Flow Analysis using GPUs Qian Gong, Wenji Wu, Phil DeMar (Fermilab)
Motivation Performance
Challenges
• Single-threaded packet analysis system has limited performance,e.g., Snort: 0.2 Gbit/s when perform DPA
• Multi-queue NICs and multicore systems provide opportunities.
Multithreading GDPFA Framework
GDPFA Key Mechanism
Conclusion
• Network packet monitoring and deep packet analysis (DPA) arewidely used to support applications such as intrusion detection,surveillance, traffic control and statistics gathering.
• The growing network traffic rate and rising sophistication in the typeof network attacks are pushing for faster and more complex (e.g.,flow-based) packet processing systems.
• GPUs work exceptionally well for network packet analysis; but asolid flow-based GPU packet inspection tool is missing.
Challenges in DPA for TCP Packets
• Payload of packet affiliated to the same TCP stream need to beassembled before matching against pre-defined patterns.
• Conventional packet normalizers have to buffer all packets followinga missing packet, until they become in-sequence again, to preventTCP fragmentation evasion attacks.
T T C A
T
T
A T A KC
A K
ATC T
A
KAlready received and forwarded data
Buffered data
New data
Internet Router
Internal networkPort mirrorSwitch
Network AnalyzerMulti-GbE
NIC
RSS
RQ1
Core 2
Core 2
… …RQ2
Challenges in Packet Capturing
…
….
Captured Data in Queue
Cap
ture
th
read
1
Cap
ture
th
read
2
Cap
ture
th
read
n
RQ 1
. .
RQ 2
RQ n
Mul
ti-qu
eue
NIC
Mul
ti-co
re
Host
sys
tem
s
GPU Domain
Packet buffer 2
Flow Classification
TCP Stream Reassembly
Pattern Matching
Stream Processing
connec next
…
connection records
next hash bucket
hash(4-tuple)
TCB Connection Table
hk state
Statistical Analysis
Packet buffer 1
Key FunctionsFlow Classification and Reordering:• Inter-flow classification: sort streams according to their TCP 4-tuple• Intra-flow reorder: sort same-stream affiliated packets by their sequence numberPacket Normalization: drop duplicate packets and merge the overlapping payloadPattern Matching: match the payload against fixed stringsStream Processing: keep the internal states between consecutive batches
Flow Classification
raw packet
packets in flow and sequence order
sort by (4-tuple|seq #)
p3 p2 p1 p4 p5 p9p6p7p8 p4
p1 p3 p7 p2 p4 p9p8p6p4 p5
next packet array
flow identifier
1 1 1 2 2 3332 2
filter + scan
Stream Normalization
3 7 4 5 98end end endn/a
bytes of overlapping data (prefer new data) scan seq #
0 n1 n2 0 n3 n6n50n4n/a
Pattern Matching
Intra-batch: AC automaton
Keywords: X = {he, his, she, hers}
state transition automaton
• One thread per packet• Each thread scans extra n bytes
towards its consecutive packet, where n is the maximum branch length of the AC tree
Inter-batch: AC automaton &AC-suffix automaton
thread k
thread k+1
thread k+2
Received packetsNew packets
AC-Suffix-Tree
Suffix set of X: {e,is,s,he,ers,rs}
Out-of-order Packets
case 1AC state
AC-suffix state
case 2
AC-suffix state case 3
AC state
Parallel sorting for intra-batch flow classification and packet reordering.
0 1 3 6 9
4 7
2 5 8
h e r s{hers}
{his}
{she}
seh
is
0
1 2 3 4 5
6
e i r s
7
hr
s8 9
e s s10
AC and AC-suffix automaton to preserve the inter-batch pattern matching states.
Traffic source: real traffics mirrored from the Fermilab border router, with a mean packet length 1042-byte.Base system: Intel Xeon CPU E5-2650 v3, NVIDIA GPU K40Pattern for searching: 2120 string extracted from Snort rules
• Develop a highly efficient packet/flow analysis tool hat fully utilizes the parallelism of multicore systems, NICs, and GPUs.
• Present a novel GPU-centric TCP state management and stream reassembly framework.
• Perform on-the-fly DPA without requiring dropping or buffering the OOO packets by implementing an AC-suffix-tree method on GPU.
BPF filter Output
Out-of-order (OOO) packet
Comparison to Existing ToolsSnort1 AC-suffix GASPP3 Ours
Computing platform CPU CPU GPU GPUMethods Stream
ReassemblySplit
detectionIntra-batch
stream reassembly
In-batch streamreassembly +
inter-batch split detection
Detection over OOO packets
✔ ✔ limited ✔
Resistance to fragmentation flood
N Y N Y
Throughput Low Low High High
[1] Roesch, Martin. "Snort: Lightweight intrusion detection for networks." Lisa. Vol. 99. No. 1. 1999.[2] Chen, Xinming, et al. ``AC-suffix-tree: Buffer free string matching on out-of-sequence packets." Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on. IEEE, 2011.[3] Vasiliadis, Giorgos, et al. "Design and Implementation of a Stateful Network Packet Processing Framework for GPUs." IEEE/ACM Transactions on Networking (2016).
GPU CPUTCP Reassembly 552.28 Gbit/s 4.55 Gbit/s (Libnids)Pattern Matching 62.53 Gbit/s 0.271 Gbit/s (Snort)
Raw Processing Throughput
GPU (8-capture-threads) CPU (single-threading) CPU (8-threads)w/ pkt transfer wo/ pkt transfer Snort1 CPU-AC-suffix2 CPU-AC-suffix
31.95 Gbit/s 54.61 Gbit/s 0.248 Gbit/s 0.613 Gbit/s 3.575 Gbit/s
End-to-End Throughput
Selective Prior Work[1] Wenji Wu, et al. “Deep Packet Analysis using GPUs”. GTC 2017.[2] Wenji Wu, Phil DeMar “Wirecap: a novel packet capture engine for commodity NICs in high-speed networks”. Proceedings of the 2014 Conference on Internet Measurement Conference. ACM, 2014.