high-throughput subset matching on commodity gpu-based … · subset match useful in many scenarios...

Post on 19-Jul-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

High-Throughput Subset Matching on

Commodity GPU-Based Systems

Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗

Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%

presented by

Daniele Rogora

%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems

Milano Lugano Paris

Italy Switzerland France

EuroSys 2017

1 / 30

Subset Match

Useful in many scenarios

Social networks, Twitter

2 / 30

Subset Match

Useful in many scenarios

Social networks, Twitter

Data Center management

2 / 30

Subset Match

Useful in many scenarios

Social networks, Twitter

Data Center management

Service brokering

2 / 30

Subset Match

Useful in many scenarios

Social networks, Twitter

Data Center management

Service brokering

Cloud 3.0

2 / 30

Example

Subscribers Tag Set...

.

.

.

Daniele

{#football, #acmilan}

{#politics, #Italy}

Antonio {#politics, #USA}

{#chomsky}...

.

.

.

3 / 30

Example

Subscribers Tag Set...

.

.

.

Daniele

{#football, #acmilan}

{#politics, #Italy}

Antonio {#politics, #USA}

{#chomsky}...

.

.

.

#politics, #USA

#Italy#politics, #USA,

#Italy#politics, #USA,

#Italy

3 / 30

Example

Subscribers Tag Set...

.

.

.

Daniele

{#football, #acmilan}

{#politics, #Italy}

Antonio {#politics, #USA}

{#chomsky}...

.

.

.

#politics, #USA

#Italy#politics, #USA,

#Italy#acmilan, #closing,

#news, #football

3 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA}

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 1

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 11

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

4 / 30

Tagsets Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

D = {politics, Italy, USA} 1 111 1

4 / 30

Tagsets Representation

1 2 3 4 5 6 7 8 9 10

1 111 100 0 0 0

4 / 30

Example

Subscribers Bit String...

.

.

.

Daniele

{#football, #acmilan}

{#politics, #Italy}

Antonio {#politics, #USA}

{#chomsky}...

.

.

.

#politics, #USA

#Italy#politics, #USA,

#Italy#politics, #USA,

#Italy

5 / 30

Example

Subscribers Bit String...

.

.

.

k1

aaa1001101000aaa

0010010011

k2 1001000011

0000101000...

.

.

.

101101001110110100111011010011

5 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5,k2match

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5match-unique

6 / 30

Model

Tagset

table

Bit String Keys

1000100000 k2

1010000100 k4,k2

0110100000 k3

0011100010 k6,k2

0010101000 k5,k2

0000100100 k2

Query stream

0110101100

Output

k2,k3,k5match-unique

The stream of filters is

intense: 6k queries/s

The database is huge:

212M tag sets

6 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —

GPU-only, plain 0.40 0.20 0.04

GPU-only, plain with batching 11.50 6.30 1.20

CPU-only, fast prefix tree 21.10 14.00 4.30

CPU-only, state-of-the-art ICN 27.60 17.40 —

CPU-only, Tagmatch 3.90 3.40 0.68

Tagmatch 268.80 144.40 35.30

(throughput: thousand queries per second)

7 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —

GPU-only, plain 0.40 0.20 0.04

GPU-only, plain with batching 11.50 6.30 1.20

CPU-only, fast prefix tree 21.10 14.00 4.30

CPU-only, state-of-the-art ICN 27.60 17.40 —

CPU-only, Tagmatch 3.90 3.40 0.68

Tagmatch 268.80 144.40 35.30

(throughput: thousand queries per second)

Rivest, 1976

7 / 30

A Complex Problem

database size

system 20M 40M 212M

MongoDB — — —

GPU-only, plain 0.40 0.20 0.04

GPU-only, plain with batching 11.50 6.30 1.20

CPU-only, fast prefix tree 21.10 14.00 4.30

CPU-only, state-of-the-art ICN 27.60 17.40 —

CPU-only, Tagmatch 3.90 3.40 0.68

Tagmatch 268.80 144.40 35.30

(throughput: thousand queries per second)

7 / 30

TagMatch

8 / 30

First Approach: using GPUs

Kernel

9 / 30

First Approach: using GPUs

Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

9 / 30

First Approach: using GPUs

Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

9 / 30

First Approach: using GPUs

Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q

9 / 30

First Approach: using GPUs

Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q

thread i

if (si ⊆ q)

results.add(q)

9 / 30

First Approach: using GPUs

Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

q0 q1 q2 q3 q4 . . . q255

thread i

for (q ∈ q0 . . . q255)

if (si ⊆ q)

results.add(q)

9 / 30

First Approach: using GPUs

CPU: launch kernel

CPU: merge matches with keys

results

key

table

q0 q1 q2 q3 q4 . . . q255Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

9 / 30

First Approach: using GPUs

CPU: launch kernel

CPU: merge matches with keys

results

key

table

q0 q1 q2 q3 q4 . . . q255Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

This is not fast enough

database size

system 20M 40M 212M

MongoDB — — -–

GPU-only, plain 0.40 0.20 0.04

GPU-only, plain with batching 11.50 6.30 1.20

CPU-only, fast prefix tree 21.10 14.00 4.30

CPU-only, state-of-the-art ICN 27.60 17.40 —

CPU-only, Tagmatch 3.90 3.40 0.68

Tagmatch 268.80 144.40 35.30

(throughput: thousand queries per second)

9 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

10 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

10 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

10 / 30

Partitioning

lots of filters share many bits...

we could filter out many filters efficiently and quickly...

Bit String Keys

1000100000 k2

1010100100 k4,k2

0110100000 k3

0011000010 k6,k2

0011101000 k5,k2

0001100100 k2

0001011100

and we can do that efficiently on the cpu, while preserving

batches

10 / 30

Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.

input queries (stream)

q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.

Bloom-filterencoding

⋆ “unique” query

pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

11 / 30

Model{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.

input queries (stream)

q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.

Bloom-filterencoding

⋆ “unique” query

pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU 11 / 30

{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.

input queries (stream)

q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.

Bloom-filterencoding

⋆ “unique” query

pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU

Partitioning

12 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

13 / 30

Partitioning

Max size: 3

P Bit String

0

1000100000

1010000100

0110100000

0011100010

0010101000

0001101101

0000110100

0000110001

0000010110

0000001110

P Bit String

0

1010000100

0001101101

0000110100

0000010110

0000001110

1

1000100000

0110100000

0011100010

0010101000

0000110001

P Bit String

00001101101

0000110100

1

1010000100

0000010110

0000001110

2

0110100000

0011100010

0010101000

31000100000

0000110001

13 / 30

Partitioning

P Mask Bit String

00001101101

0000110100

1

1010000100

0000010110

0000001110

2

0110100000

0011100010

0010101000

31000100000

0000110001

13 / 30

Partitioning

P Mask Bit String

00000100100 0001101101

0000110100

1

1010000100

0000000100 0000010110

0000001110

2

0110100000

0010100000 0011100010

0010101000

30000100000 1000100000

0000110001

13 / 30

{@POTUS,energy,policy}{@Chomsky,education}{@ggreenwald,NSA}⋆

.

.

.

input queries (stream)

q1= 010101 · · ·11

q2= 011111 · · ·01

q⋆

3= 001110 · · ·11

.

.

.

Bloom-filterencoding

⋆ “unique” query

pre

-pro

cess

CPU

0 none

1 010001 · · ·01 → P1

2001100 · · ·00 → P2001010 · · ·11 → P3001011 · · ·01 → P4

3000101 · · ·10 → P5

. . .

· · · · · ·191 . . .

partition table

su

bset

matc

h

GPU

P1

011011 · · ·01 ↔ 1010101 · · ·11 ↔ 2010101 · · ·01 ↔ 3

. . .

P2

001101 · · ·10 ↔ 62001101 · · ·01 ↔ 63001100 · · ·11 ↔ 64

. . .

.

.

.

.

.

.

tagset table

. . . ,q2

batch1 P1

. . . ,q2 ,q3

batch2 P2

. . . ,q1 ,q3

batch3 P3

.

.

.

key

loo

ku

p/r

ed

uce

CPU

1 → k1 ,k23 → k2 ,k6 ,k8

.

.

.

63 → k5 ,k8 ,k13

.

.

.

key table

q2 ,1,q2 ,3, . . .

results1

q2 ,63,q3 ,71, . . .

results2

q1 ,324,q3 ,99, . . .

results3

.

.

.

q1 →k3 ,k13 , . . .

q2 →k1 ,k2 ,k2 ,

k6 ,k8 ,k5 ,

k8 ,k13 , . . .

q⋆

3 →k9 ,k3 ,k37 ,

k3 ,k7 , . . .

.

.

.

results (stream)

merge

CPU

Pre-process

14 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q0

q0

q0

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q1

q1

q1

q0 q1

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q2

q2

q2

q1

q2

q0 q1 q2

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

q0 q1 q2

flush

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

Timeout expired!

15 / 30

Pre Process

front

end

1st bit Mask...

.

.

.

2 0010100000 → P2

40000100100 → P0

0000100000 → P3

7 0000000100 → P1

.

.

....

thread poolfooooo

partition

queues

P0

P1

P2

P3

Pn

GPU

handlers

GPUscheduler

q1

q2

flush

15 / 30

Optimization

16 / 30

GPU Optimization

q0 q1 q2 q3 q4 . . . q255Kernel

Block 0 Block 1 Block 2

Block 3 Block 4 Block 5

Block 6 Block . . . Block n

tagset

table

s0

s1

s2

.

.

.

.

.

.

sn−2

sn−1

sn

17 / 30

GPU Optimization

Kernel q0 q1 q2 q3 q4 . . . q255

Block 0

t255 | 1110010100

. . . | . . .

t2 | 1110100000

t1 | 1110110000

t0 | 1110110110

Block 1

t255 | 0011101101

. . . | . . .

t2 | 0101101011

t1 | 0110001110

t0 | 0110010110

17 / 30

GPU OptimizationPhase 1

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2

idlefirst = 1110110110

last = 1110010100

17 /

GPU OptimizationPhase 1

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2

idlefirst = 1110110110

last = 1110010100

first ⊕ last = 0000100010

17 /

GPU OptimizationPhase 1

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

idle

Thread 1

idle

Thread n

idle

Thread 2

idlefirst = 1110110110

last = 1110010100

first ⊕ last = 0000100010

prefix = 1110000000

common prefix = 1110000000

17 / 30

GPU OptimizationPhase 2

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

prefix ⊆ q3?

Thread 1

prefix ⊆ q1?

Thread n

prefix ⊆ qn?

Thread 2

prefix ⊆ q2?

common prefix = 1110000000

prefix ⊆ q0?

Q =

17 / 30

GPU OptimizationPhase 2

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

V

Thread 1

V

Thread n

?

Thread 2

X

common prefix = 1110000000

V

q1 q3 q21 q0 q200q177Q =

17 / 30

GPU OptimizationPhase 3

Kernel q0 q1 q2 q3 q4 . . . q255

Block

Thread 0

Thread 3

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread 1

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread n

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

Thread 2

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

common prefix = 1110000000

for (qi ∈ Q)

if (f ⊆ qi )

results.add(qi )

q1 q3 q21 q0 q200q177Q =

17 / 30

Workflow Optimization

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size Data

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size Data

copy res size

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

copy res data

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3

Data

copy res size

syn

c

copy res data

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3 q7,q21,q1

Data

copy res size

syn

c

copy res data

syn

c

18 / 30

Workflow Optimization

run kernel

Size

3 q7,q21,q1

Data

GPU

CPU

Size

3 q7,q21,q1

Data

copy res size

syn

c

copy res data

syn

cprocess res

18 / 30

Workflow Optimization

run kernel

copy all res

process ressyn

c

Size Data

GPU

CPU

Size Data

18 / 30

Workflow Optimization

GPU

CPU

Size Data

Size Data

18 / 30

Workflow Optimization

GPU

CPU

Size Data

q207,q17

Size Data

Size Data

Size

2

Data

Workflow Optimization

GPU

CPU

Size

3

Data

q207,q17

Size Data

Size Data

q7,q21,q1

Size

2

Data

run kernel

Workflow Optimization

GPU

CPU

Size

3

Data

q207,q17

Size Data

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res

Workflow Optimization

GPU

CPU

Size

3

Data

q207,q17

Size

3

Data

q207,q17

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

Workflow Optimization

GPU

CPU

Size

3

Data

q207,q17

Size

3

Data

q207,q17

Size Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res

Workflow Optimization

GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res

run kernel

Workflow Optimization

GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

2

Data

run kernel

copy res

syn

c

process res

run kernel

copy res

Workflow Optimization

GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

4

Data

q7,q21,q1

run kernel

copy res

syn

c

process res

run kernel

copy ressyn

c

Workflow Optimization

GPU

CPU

Size

3

Data

q87,q12,q1,q5

Size

3

Data

q207,q17

Size

4

Data

q7,q21,q1

Size

4

Data

q7,q21,q1

run kernel

copy res

syn

c

process res

run kernel

copy ressyn

c

process res18 / 30

Workflow Optimization

run kernel

copy res size

copy res data

process res

syn

csyn

c

run kernel

copy all res

process res

syn

c

run kernel

copy res

process res

run kernel

copy res

process res

syn

csyn

c

18 / 30

Evaluation

19 / 30

Evaluation

1 single machine

24 (48) physical (virtual) cpu cores

2 Nvidia Titan X

19 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Database size (% of the full Twitter database)

TagMatch, matchTagMatch, match-unique

Does it scale with bigger databases?

20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Database size (% of the full Twitter database)

TagMatch, matchTagMatch, match-unique

20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Database size (% of the full Twitter database)

TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique

20 / 30

Scalability

1

10

100

20 30 40 50 60 70 80 90 100

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Database size (% of the full Twitter database)

TagMatch, matchTagMatch, match-uniqueprefix tree, matchprefix tree, match-unique

Twitter

20 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads

TagMatch, matchTagMatch, match-unique

prefix tree, matchprefix tree, match-unique

Does it scale with bigger machines?

21 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads

TagMatch, matchTagMatch, match-unique

prefix tree, matchprefix tree, match-unique

21 / 30

Threads

0

10

20

30

40

50

8 16 24 32 40 48

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of threads

TagMatch, matchTagMatch, match-unique

prefix tree, matchprefix tree, match-unique

GPU limit!

21 / 30

Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

200 400 600 800 no limit

Late

ncy

(s)

Timeout (ms)

1%, 25%, median, 75%, 99%maximum

Does batching kill latency?

22 / 30

Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

200 400 600 800 no limit

Late

ncy

(s)

Timeout (ms)

1%, 25%, median, 75%, 99%maximum

22 / 30

Memory usage

5

10

15

20

25

30

0 20 40 60 80 100

Mem

ory

usag

e(G

B)

Database size (% of the full Twitter database)

GPU, I/O buffersGPU, tagset table

Host

How much memory does it need?

23 / 30

Memory usage

5

10

15

20

25

30

0 20 40 60 80 100

Mem

ory

usag

e(G

B)

Database size (% of the full Twitter database)

GPU, I/O buffersGPU, tagset table

Host

23 / 30

Conclusion

subset matching

24 / 30

Conclusion

subset matching◮ computationally complex◮ highly parallelizable

24 / 30

Conclusion

subset matching◮ computationally complex◮ highly parallelizable

TagMatch

24 / 30

Conclusion

subset matching◮ computationally complex◮ highly parallelizable

TagMatch◮ implements an efficient CPU/GPU pipeline

24 / 30

Conclusion

subset matching◮ computationally complex◮ highly parallelizable

TagMatch◮ implements an efficient CPU/GPU pipeline

https://github.com/carzaniga/TagMatch

24 / 30

High-Throughput Subset Matching on

Commodity GPU-Based Systems

Daniele Rogora∗ Michele Papalini$ Koorosh Khazaei∗

Alessandro Margara% Antonio Carzaniga∗ Gianpaolo Cugola%

presented by

Daniele Rogora

%Politecnico di Milano ∗Università della Svizzera italiana $Cisco Systems

Milano Lugano Paris

Italy Switzerland France

EuroSys 2017

25 / 30

Partition size

0

5

10

15

20

25

30

35

40

0 100 200 300 400 500 600 700 800 900

Thr

ough

put

(tho

usan

d qu

erie

s/s)

MAXP: Maximum size of partitions (thousands)

matchmatch-unique

26 / 30

Mongo DB

10-1

100

101

102

103

104

105

106

4 5 6 7 8 9 10

Thr

ough

put

(que

ries/

s)

Number of tags per query

TagMatch 1MTagMatch 3MTagMatch 5M

MongoDB 1MMongoDB 3MMongoDB 5M

27 / 30

Partitioning time

0

10

20

30

40

50

10 20 30 40 50 60 70 80 90 100

Tim

e (s

)

Database size (% of the full Twitter database)

balanced partitioning

28 / 30

More tags

0.1

1

10

100

1000

0 1 2 3 4 5 6 7 8 9

Thr

ough

put

(tho

usan

d qu

erie

s/s)

Number of additional tags per query

TagMatchprefix tree

100

1000

10000

100000

0 1 2 3 4 5 6 7 8 9

Out

put t

hrou

ghpu

t(t

hous

and

keys

/s)

Number of additional tags per query

TagMatchprefix tree

29 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

h1

h2

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

Descriptors Representation

Representation of tagsets with Bloom filters

a bitvector of size m

k independent hash functions h1, . . . ,hk

hi : Tags →{1, . . . ,m}

Example: (k = 2,m = 10)

1 2 3 4 5 6 7 8 9 10

D = {politics, Italy, USA} 1 111 1

Concretely, in our implementation: m = 192,k = 7

False positives: testing S1 ⊆ S2 with Bloom fil-

ters gives a false positive with probability 1 −

e−k |S2|mk |S1\S2|

For example, when |S2| = 10 and |S1 \S2| = 3, we

have a false positive with probability 10−11

30 / 30

top related