high-performance packet classification on gpu author: shijie zhou, shreyas g. singapura and viktor...
TRANSCRIPT
High-Performance Packet Classification on GPU
Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna
Publisher: HPEC 2014
Presenter: Gang Chi
Date: 2014/12/2
Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
Introduction (1/2)
This paper investigate GPU’s characteristics in parallelism and memory accessing, and implement our packet classifier using CUDA.
The basic operations of this design are binary range-tree search and bitwise AND operation.
Optimize the design by storing the range-trees using compact arrays without explicit pointers in shared memory.
National Cheng Kung University CSIE Computer & Internet Architecture Lab
2
Introduction (2/2)
When the size of rule set is 512, this design can achieve the throughput of 85 MPPS and the average processing latency of 4.9 us per packet.
Compared with the implementation on the state-of-the-art multi-core platform, this design demonstrates 1.9x improvement with respect to throughput.
National Cheng Kung University CSIE Computer & Internet Architecture Lab
3
CUDA Memory Model
National Cheng Kung University CSIE Computer & Internet Architecture Lab
4
Type Location Access cycle Size
Global memory Off-chip >100 1~32GB per GPU
L1 cache On-chip 1~32 16 or 48KB per SMX
L2 cache On-chip 1~32 64KB per SMX
Registers On-chip n/a 32-bit x 65536 per SMX
Algorithm
Phase 1: each thread examines N/K rules and produces a local classification result. Phase 2: the rule with the highest priority among the K local results is identified in logK steps.
National Cheng Kung University CSIE Computer & Internet Architecture Lab
5
Pre-process Pre-process rules to construct a binary range-tree for each individual field. Every leaf node is
assigned with BVs, which can infer which rules are matched when reaching the leaf node.
National Cheng Kung University CSIE Computer & Internet Architecture Lab
6
Search Each thread performs binary range-tree
search sequentially field by field. After 5 tree searches, 5 BVs are produced.
Merge the 5 BVs by bitwise AND operation to obtain a final BV.
The result is the index of first non-zero bit.• Ex: BV=00100, Result=2• Ex: BV=00000, Result=65536
National Cheng Kung University CSIE Computer & Internet Architecture Lab
7
Search in Binary Range Tree
National Cheng Kung University CSIE Computer & Internet Architecture Lab
8
Ex: Search 4
Identify Global Result
National Cheng Kung University CSIE Computer & Internet Architecture Lab
9
Experimental Platform
CUDA 5.0 Intel E5-2665 x2
• 2.4 GHz• 8-core
NVIDIA K20 Kepler GPU • 705.5 MHz• 13 SMX with total 2496 CUDA cores• 5GB GDDR5
National Cheng Kung University CSIE Computer & Internet Architecture Lab
10
Latency and Throughput
National Cheng Kung University CSIE Computer & Internet Architecture Lab
11
Comparison with implementation on Multi-core
[8] S. Zhou, Y. Qu and V. K. Prasanna, “Multi-core implementation of decomposition-based packet classification algorithms,” in Parallel Computing Techniques (PaCT), pp. 105-119, 2013.
National Cheng Kung University CSIE Computer & Internet Architecture Lab
12