high-performance packet classification on gpu author: shijie zhou, shreyas g. singapura and viktor...

High-Performance Packet Classification on GPU

Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna

Publisher: HPEC 2014

Presenter: Gang Chi

Date: 2014/12/2

Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Introduction (1/2)

This paper investigate GPU’s characteristics in parallelism and memory accessing, and implement our packet classifier using CUDA.

The basic operations of this design are binary range-tree search and bitwise AND operation.

Optimize the design by storing the range-trees using compact arrays without explicit pointers in shared memory.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

2

Introduction (2/2)

When the size of rule set is 512, this design can achieve the throughput of 85 MPPS and the average processing latency of 4.9 us per packet.

Compared with the implementation on the state-of-the-art multi-core platform, this design demonstrates 1.9x improvement with respect to throughput.


3

CUDA Memory Model


4

Type Location Access cycle Size

Global memory Off-chip >100 1~32GB per GPU

L1 cache On-chip 1~32 16 or 48KB per SMX

L2 cache On-chip 1~32 64KB per SMX

Registers On-chip n/a 32-bit x 65536 per SMX

Algorithm

Phase 1: each thread examines N/K rules and produces a local classification result. Phase 2: the rule with the highest priority among the K local results is identified in logK steps.


5

Pre-process Pre-process rules to construct a binary range-tree for each individual field. Every leaf node is

assigned with BVs, which can infer which rules are matched when reaching the leaf node.


6

Search Each thread performs binary range-tree

search sequentially field by field. After 5 tree searches, 5 BVs are produced.

Merge the 5 BVs by bitwise AND operation to obtain a final BV.

The result is the index of first non-zero bit.• Ex: BV=00100, Result=2• Ex: BV=00000, Result=65536


7

Search in Binary Range Tree


8

Ex: Search 4

Identify Global Result


9

Experimental Platform

CUDA 5.0 Intel E5-2665 x2

• 2.4 GHz• 8-core

NVIDIA K20 Kepler GPU • 705.5 MHz• 13 SMX with total 2496 CUDA cores• 5GB GDDR5


10

Latency and Throughput


11

Comparison with implementation on Multi-core

[8] S. Zhou, Y. Qu and V. K. Prasanna, “Multi-core implementation of decomposition-based packet classification algorithms,” in Parallel Computing Techniques (PaCT), pp. 105-119, 2013.


12

high-performance packet classification on gpu author: shijie zhou, shreyas g. singapura and viktor...

Documents