content-based matching on gpus

23
High Performance Content-Based Matching Using GPUs Alessandro Margara and Gianpaolo Cugola [email protected] , [email protected] Dip. Elettronica e Informazione (DEI) Politecnico di Milano

Upload: alessandro-margara

Post on 29-Jun-2015

813 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs

Alessandro Margara and Gianpaolo [email protected], [email protected]

Dip. Elettronica e Informazione (DEI)Politecnico di Milano

Page 2: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 2

The Problem: Content-Based MatchingPublishers SubscribersContent-Based Matching

(Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”)Light=50,

Room=Bedroom, Sender=“Sensor1”

Filter

Constraint

Predicate

Attribute

Page 3: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 3

• Introduced by Nvidia in 2006• General purpose parallel computing architecture

– New instruction set– New programming model– Programmable using high-level languages

• Cuda C (a C dialect)

Programming GPUs: CUDA

Page 4: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 4

Programming Model: Basics– The device (GPU) acts as a coprocessor for the host (CPU)

and has its own separate memory space• It is necessary to copy input data from the main memory to the

GPU memory before starting a computation …• … and to copy results back to the main memory when the

computation finishes– Often the most expensive operations

» Involve sending information through the PCI-Ex bus» Bandwidth but also latency

– Also requires serialization of data structures!» They must be kept simple

Page 5: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 5

Typical Workflow

Allocate memory on device

Serialize and copy data to device

Execute one or more kernels on the device

Wait for the device to finish processing

Copy results back

Page 6: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 6

Programming Model: Fundamentals• Single Program Multiple Threads implementation

strategy– A single kernel (function) is executed by multiple threads

in parallel• Threads are organized in blocks

– Threads within different blocks operate independently– Threads within the same block cooperate to solve a

single sub-problem• The runtime provides a blockId and a threadId variable,

to uniquely identify each running thread– Accessing such variables is the only way to differentiate

the work done by different threads

Page 7: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 7

Programming Model: Memory management

• Hierarchical organization of memory– All threads have access to the same common global memory

• Large (512MB-6GB) but slow (DRAM)• Stores information received from the host• Persistent across different function calls

– Threads within a block coordinate themselves using a shared memory• Implemented on-chip

– Fast but limited (16-48KB)

– Each thread has its own local memory• It’s the only “cache” available

– No hardware/system support– Must be explicitly controlled by the application code

Page 8: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 8

More on Memory Management

• Without hardware managed caches, accesses to global memory can easily become a bottleneck

• Issues to consider when designing algorithms and data structures– Maximize usage of shared (block local) memory

• Without overcoming its size– Threads with contiguous ids should access contiguous

global memory regions• Hardware can combine them into several memory-wide

accesses

Page 9: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011

Hardware Implementation

• An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores– Each SM executes threads in groups of

32 called warps• Scheduling is performed in hardware with

zero overhead– Optimized for data parallel problems

• Maximum efficiency only if all threads in a warp agree on the execution path

9

Page 10: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 10

Some Numbers

• NVIDIA GTX 460• 1GB RAM (Global Memory)• 7 Streaming Multiprocessors• Each SM contains 48 cores• Each SM manages up to 48 warps (32 threads each)• Up to 10752 threads managed concurrently!!!

– Up to 336 threads running concurrently!!!• Today’s cheap GPU: less than 160$

Page 11: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 11

Existing Algorithms

• Two approaches– Counting algorithms– Tree-based algorithms

• Complex data structures to optimize sequential execution– Trees, Maps, …– Lots of pointers!!!

• Hardly fit the data parallel programming model!

Page 12: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 12

Algorithm DescriptionF1: A>10 and B=20 F2: B>15 and C<30

Constraint Filter

A>10 F1

B=20 F1

B>15 F2

C<30 F2

D=20 F3

S1

S2 F3: D=20

Filter Size Count Interface

F1 2 S1

F2 2 S1

F3 1 S2

0

0

0

1

1

2

A=12B=20A=12B=20

Page 13: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 13

Algorithm Description

• Constraints with the same name are stored in array on the GPU– Contiguous memory

regions• When processing an event

E, the CPU selects all relevant constraint arrays– Based on the name of the

attributes in E

Page 14: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 14

Algorithm Description• Bi-dimensional organization of

threads– One thread for each

attribute/constraint pair• Threads in the same block

evaluate the same attribute– It can be copied in shared memory

• Threads with contiguous ids access contiguous constraints– Accesses combined into several

memory-wide operations• Filters count updated with an

atomic operation

B=32 Event attributesC=21A=7

Page 15: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 15

Improvement• Problem: before processing each

event we need to reset filters count and interfaces selection vector

• Naïve version: use a memset– Communication with the GPU

introduces additional delay• Solution: two copies of filters

count and interfaces vector• While processing an event

– One copy is used– One copy is reset for the next

event– Inside the same kernel

• No communication overhead

Page 16: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 16

Results: Default Scenario• Comparison against state of the

art sequential implementation– SFF (Siena) 1.9.4– AMD CPU @ 2.8GHz

• Default scenario– Relatively “simple”– 10 interfaces, 25k filters, 1M

constraints• Analysis changing various

parameters• We measure latency

– Processing time for a single event

7x

Page 17: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 17

Results: Number of Constraints

10x

Page 18: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 18

Results: Number of Filters

13x

Page 19: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 19

Results• What is the time needed to

install subscriptions?– Need to serialize data

structures– Need to copy from CPU

memory to GPU memory– But data structures are simple!

• Memory requirements?– 35MB in the default scenario– Up to 200MB in all our tests– Not a problem for a modern

GPU

Page 20: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 20

Results

• We measured the latency when processing a single event– 0.14ms processing time 7000 events/s? – What about the maximum throughput?

9400events/s

Page 21: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 21

Conclusions

• Benefits of GPU in a wide range of scenarios– In particular in the most challenging workloads

• Additional advantage– It leaves the CPU free to perform other tasks

• E.g. Communication related tasks• Available for download

– Includes a translator from Siena subscriptions / messages– More info at http://home.dei.polimi.it/margara

Page 22: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 22

Future Work• We are currently working with multi-core CPUs

– Using OpenMP• We are currently testing our algorithm within a real system

– Both GPUs and multi-core CPUs– Take into account communication overhead– Measure of latency and throughput

• We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching– Encoded filters (Bloom filters)– Balance between performance and percentage of false positives

Page 23: Content-Based Matching on GPUs

High Performance Content-Based Matching Using GPUs - DEBS 2011 23

Questions?