Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar
1. The Problem
In high speed networks (10 Gbps and above), network traffic monitoring and analysis applications that require scrutiny on a per-packet basis typically demand immense computing power and very high I/O throughputs. These applications face extreme performance and scalability challenges.
2. Our solution: Use GPU-based Traffic Monitoring and Analysis Tools
At Fermilab, we have prototyped a network traffic monitoring and analysis system using GPUs.
3. Why Choose GPU?
Characteristics of packet-based network monitoring & analysis applications:
• Time constraints on packet processing. • Compute and I/O throughput-intensive. • High levels of data parallelism. • Extremely poor temporal locality for data.
Requirements on computing platform for high performance network monitoring & analysis applications:
• High Compute power • Ample memory bandwidth • Capability of handing data parallelism
inherent with network data • Easy programmability
Three types of computing platforms available:
• NPU/ASIC • CPU • GPU
GPU is well suited for network monitoring and analysis in high-speed networks
Features NPU/ASIC CPU GPU
High compute power Varies ✖ ✔
High memory bandwidth Varies ✖ ✔
Easy programmability ✖ ✔ ✔
Data-parallel execution model ✖ ✖ ✔
Architecture Comparison
...
1. Traffic Capture2. Preprocessing GPU Domain
Monitoring & Analysis
Kernels
Output
User Space
Output
Output
3. Monitoring & Analysis
4. Output Display
Packet
Buffer
Network Packets
NICs
Packet
BufferOutput
...
Capturing
CapturedData
Packet Chunks
Four types of Logical Entities:
(1) Traffic Capture (2) Preprocessing (3) Monitoring & Analysis (4) Output Display
Key Mechanisms:
• Partial packet capture approach: GPU has a relatively small memory size. Only packets headers are copied into the GPU domain instead of the the entire packets.
• A new packet I/O engine: Use commodity NICs to capture network traffic into the CPU domain without packet drops.
• A GPU-accelerated library for network monitoring and
analysis consisting of dozens of CUDA kernels, which can be combined in multiple ways for intended tasks
4. System Architecture
Free Packet Buffer Chunks
OS Kernel
User Space
...
Capture
Attach
Recycle
Recv Descriptor Ring
Packet Buffer Chunk
Incoming PacketsNIC
...
Descriptor Segments
Processing Data
Packet I/O Engine
Key techniques • Pre-allocated large packet
buffers • Packet-level batch processing • Mem-mapping based zero-
copy
1
raw_pkts [ ]
filtered_pkts [ ]
filtering_buf [ ]
scan_buf [ ]
index
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
index
index
0 1 2 3
Filtering
1
Scan
2
Compact
3
p2p1 p3 p5p4 p6 p8p7
0 1 1 0 0 1 0
0 1 1 2 3 3 3 4
p1 p3 p4 p7
x x xx
Packet-Filtering Kernel
Advanced packet filtering capabilities at wire speed are necessary so that we only analyze those packets of interest to us.
5. Initial Results
• GPU can significantly speed up packet processing. Compared to a single core CPU, the GPU’s speedup ratios vary from 8.82 to 17.04. When compared to a 6-core CPU, the speedup ratios range from 1.54 to 3.20.
GPU Packet-filtering Algorithm Evaluation
BPF: “UDP” BPF: net 131.225.107 & tcp
0
20
40
60
80
100
120
140
160
180
mmap-gpu s-cpu-1.6 s-cpu-2.0 s-cpu-2.4 m-cpu-1.6 m-cpu-2.0 m-cpu-2.4 standard-gpu
Throughput(Unit:Millionpackets/s) Data-set-1 Data-set-2
Data-set-3 Data-set-4
0
10
20
30
40
50
60
70
80
mmap-gpu s-cpu-1.6 s-cpu-2.0 s-cpu-2.4 m-cpu-1.6 m-cpu-2.0 m-cpu-2.4 standard-gpu
Throughput(Unit:Millionpackets/s) Data-set-1 Data-set-2
Data-set-3 Data-set-4