gpu-acceleration of in-memory data analytics 2017/active17-sitaridi.pdf · design system by...

37
GPU-Acceleration of In-Memory Data Analytics Evangelia Sitaridi AWS Redshift

Upload: others

Post on 26-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU-Acceleration of In-Memory Data Analytics

Evangelia SitaridiAWS Redshift

Page 2: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPUs for Telcos• Fast query-time• Quickly identify network problems• Respond fast to customers

•Geospatial visualization• Take advantage of GPU visualization capabilities

2

No time to index data

SMS Hub traffic

*Picture taken from: http://www.vizualytics.com/Solutions/Telecom/Telecom.html

Page 3: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPUs for Social Media Analytics

3

Search terms: Match regexp: “/\B#\w*[a-zA-Z]+\w*/

debate

Filter location

Page 4: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Challenges for GPU Databases• Special threading model à Increased programming complexity• Which algorithms more efficient for GPUs?• How much multiple code paths increase cost of code maintenance?

• Special memory architecture• How to adapt data layout?

• Limited memory capacity• Data transfer cost between CPUs and GPUs

a) Through PCI/E link to the GPUb) From storage system to the GPU

• Fair comparison against software-based solutions

4

Page 5: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Challenges for GPU Databases• Special threading model à Increased programming complexity• Which algorithms more efficient for GPUs?• How much multiple code paths increase cost of code maintenance?

• Special memory architecture• How to adapt data layout?

• Limited memory capacity• Data transfer cost between CPUs and GPUs

a) Through PCI/E link to the GPUb) From storage system to the GPU

• Fair comparison against software-based solutions

5

Page 6: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

OutlineoCPU vs GPU introductionoAccelerated wildcard string search

o Insight: Change the layout of the strings in the GPU main memoryo3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries

oGompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelismo2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries

oGPUs on the cloud

6

Page 7: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

CPU-GPU Analogies

7

CPU thread GPU warpRAM Global memoryTens of threads Thousands of threadsHundreds of GBs capacity Few tens of GB

Goal: Low latency Goal: High throughput (overlapping different instructions)

Page 8: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Architecture

8

CUDA Kernel

1 …

1 …

1 ……

1 …

GPU Thread

K40:15StreamMultiprocessors

elsea++;

b++;

if(condition)

endif

Page 9: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Architecture

9

CUDA Kernel

1 …

1 …

1 ……

1 …

Shared Memory

SM1

SM2

SM15…

Global Memory

1 1 1…

1 1 1

warp 1

Branch

Branch complete

Register File

Warp scheduler

Warp scheduler

warp n

GPU Thread

Warp: Unit of execution

K40:15StreamMultiprocessors

elsea++;

b++;

if(condition)

endif

Page 10: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Architecture

10

CUDA Kernel

1 …

1 …

1 ……

1 …

Shared Memory

SM1

SM2

SM15…

Global Memory

1 1 1…

1 1 1

warp 1

…1 1

Branch

Branch complete

Register File

Warp scheduler

Warp scheduler

warp n

GPU Thread

Warp: Unit of execution

K40:15StreamMultiprocessors

elsea++;

b++;

if(condition)

endif

Page 11: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Architecture

11

CUDA Kernel

1 …

1 …

1 ……

1 …

Shared Memory

SM1

SM2

SM15…

Global Memory

1 1 1…

1 1 1

warp 1

1 1

…1

1

1

Branch

Branch complete

Register File

Warp scheduler

Warp scheduler

warp n

GPU Thread

Warp: Unit of execution

K40:15StreamMultiprocessors

elsea++;

b++;

if(condition)

endif

Page 12: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Architecture

12

CUDA Kernel

1 …

1 …

1 ……

1 …

Shared Memory

SM1

SM2

SM15…

Global Memory

1 1 1…

1 1 1

warp 1

1 1

…1

1

1

Branch

Branch complete

1 1 1…

1 1 1

Register File

Warp scheduler

Warp scheduler

warp n

GPU Thread

Warp: Unit of execution

K40:15StreamMultiprocessors

elsea++;

b++;

if(condition)

endif

Page 13: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

OutlineoCPU vs GPU introductionoAccelerated wildcard string search

o Insight: Change the layout of the strings in the GPU main memoryo3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries

oGompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelismo2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries

oGPUs on the cloud

13

Page 14: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Text Query Applications

14

ACGTACCTGATCGTAGGATCCCAAGTACATCATTTCInput ACC

Search Pattern GENOMIC DATA

Id Address

3 “9 Front St, Washington DC, 20001”

8 “3338 A Lockport Pl #6, Margate City, NJ, 8402”

9 “18 3rd Ave, New York, NY, 10016”

15 “88 Sw 28th Ter, Harrison, NJ, 7029”

16 “87895 Concord Rd, La Mesa, CA, 91142” DATABASE COLUMNS

Search Pattern

“*3rdAve*New York*”

Q2,9,13,14,16,20 of TPC-H contain expensive LIKE predicates

Wild card searches

Page 15: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Wildcard Search Challenges• Approaches simplifying search cannot be applied• String indexes, e.g. suffix trees• For query ‘%customer%complaints’ multiple queries need be issued• ’%customer%’ AND ‘%complaints%’• Confirm results

• Dictionary compression• Wildcard searches not simplified using dictionaries• String data need to be decompressed

15

Page 16: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Background: How to Search Text Fast?

16

Knuth-Morris-Pratt Algorithm

Input:Pattern:

ACACATACCTACTTTACGTACGTACACACG Character mismatch

Step 6i=5j=5

Advance to the next character:a) If the input matches to the patternb) While there is a mismatch shift to the left of the pattern

Stop when the beginning of the pattern has been reached

Shift pattern table -10 01 2 3 4

Page 17: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Background: How to Search Text Fast?

17

Knuth-Morris-Pratt Algorithm

Input:Pattern:

ACACATACCTACTTTACGTACGTACACACG Character mismatch

ACACATACCTACTTTACGTACGTShift pattern

Step 6

Step 7

i=5j=5

i=5j=1ACACACG

Advance to the next character:a) If the input matches to the patternb) While there is a mismatch shift to the left of the pattern

Stop when the beginning of the pattern has been reached

Shift pattern table -10 01 2 3 4

Page 18: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU Limiting factor: Cache Pressure

18

Warp size: 32Stream Multiprocessors: 15#Warps in each SM: 64

Cache footprint: 30720 cache lines

L2 Capacity: 12288 cache lines

Smaller cache size per thread than CPUs: Need improved locality!

Threads matching different strings

x

>>Tesla K40 architecture

Page 19: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Adapt Memory Layout: Pivoting StringsBaseline (contiguous) layout

CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA…Pivoted layoutCTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

– Split strings in equally sized pieces– Interleave pieces in memory à Improve locality

CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

19Partial solution: Threads might progress in different rate

String 1 String 2 String 3

T0 T1 T2

Initially: Each warp loads a cache line (128 bytes)

Page 20: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Baseline (contiguous) layout

CTAACCGAGTAAAGAACGTAAACTCATTCGACTAAACCGAGTAAAGA…Pivoted layoutCTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

– Split strings in equally sized pieces– Interleave pieces in memory à Improve locality

CTAAACGTCTAA…CCGAAAACACCG…GTAATCATAGTA…AAGATCGAAAGA…

20

String 1 String 2 String 3

Memory divergence!T1 T2T0

In presence of partial matches some threads might fall “behind”

Adapt Memory Layout: Pivoting Strings

Partial solution: Threads might progress in different rate

Page 21: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

21

Knuth-Morris-Pratt AlgorithmInput:Pattern:

ACACATACCTACTTTACGTACGTACACACG Character mismatch

ACACATACCTACTTTACGTACGTShift pattern

Step 6

Step 7

i=5j=5

i=6j=0ACACACG

ACACATACCTACTTTACGTACGTMismatchà Shift pattern

i=5j=3

ACACACG

KMP Hybrid: Advance input in pivoted piece size

Transform Control Flow of KMPShift pattern table

ACACATACCTACTTTACGTACGT i=5j=1

ACACACG

WhileLoop

Mismatchà Shift pattern

-10 01 2 3 4

Page 22: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU vs. CPU Comparisonselect s_suppkeyfrom supplierwhere s_comment like ’%Customer%Complaints%’

–Performance Metrics– Price ($)– Performance (GB/s)– Performance per $– Estimated energy consumption

–Evaluate three systems– CPU only system– GPU only system– CPU+GPU combined system 22

Page 23: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU CPU (Boost BM) CPU (CMPISTRI) CPU+GPU

Price ($) 3100 952 952 4052

Performance (GB/s) 98.7 40.75 43.1 138.7

Energy consumed (J) 1.27 2.49 2.35 1.78

Performance/$ 31.89 42.8 45.28 34.25

23

Circle best column value per row

GPU vs. CPU Comparison

CPU: Dual-socket E5-2620 – Band. 102.4 GB/sGPU: Tesla K40 – Band. 288 GB/s

Design system by choosing the desired trade-offs

Page 24: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

OutlineoCPU vs GPU introductionoAccelerating wildcard string search

o Insight: Change the layout of the strings in the GPU main memoryo3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries

oGompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelismo2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries

oGPUs on the cloud

24

Page 25: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Example: Why Use Compression?

DatabasesQuery Engine

Database

Cloud Warehouse

Data lakes

Amazon S3A) Reduce basic S3 costs

B) Reduce query costs

Decompression speed more important than compression speed 25

Page 26: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Background: LZ77 Compression

26

Inputcharacters

0 1 2 3 …ATTACTAGAATGTTACTAATCTGATCGGGCCGGGCCTG

Output

ATTACTAGAATGT(2,5)…

Backreferences(Position,Length)

LiteralsUnmatchedcharacters

Page 27: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Background: LZ77 Compression

27

Inputcharacters

0 1 2 3 …ATTACTAGAATGTTACTAATCTGATCGGGCCGGGCCTG

Slidingwindowbuffer Unencoded lookahead characters

Findlongestmatch Output

ATTACTAGAATGT(2,5)…

Backreferences(Position,Length)

LiteralsUnmatchedcharacters

Page 28: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

28

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C OWindowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens WIKIMEDIACOMM

Page 29: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

How to Parallelize Decompression?

29

Thread 1

Thread 2

Data block 1

Data block 2

Naïve approach performance 200 MB/s << 250 GB/s (K20x)Input file

Split input file in independent blocks

>1000 threads available!

Page 30: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU LZ77 Decompression

30

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressedinput Uncompressedoutput

Tokens

Improve utilization: Group strings of literals with the following back-reference

Page 31: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU LZ77 Decompression

31

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressedinput Uncompressedoutput

1)Readtokens(parallel)

Tokens

Page 32: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU LZ77 Decompression

32

CCGA(0,2)CGG(4,3)AGTT(12,4) CCGACGGAGTT

Compressedinput Uncompressedoutput

2)Writeliterals(parallelprefixsum)

Tokens

Page 33: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPU LZ77 Decompression

33

CCGA(0,2)CGG(4,3)AGTT(12,4)

3.2)Writeuncompressedoutput:

CCGACCCGGCCCAGTTCCGA

Compressedinput Uncompressedoutput

Problem:Back-referencesprocessedinparallelmightbedependent!àUsevotingfunction__ballot todetectdependencies

3.1)Computeuncompressedoutput

Tokens

Page 34: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

How to Handle Thread Dependencies?

34

1) Write literals (parallel)2) While(!all backreferences written)

a) Check dependencies satisfied (parallel)b) Copy back-references w/o pending

dependencies

Tokens

DE

Second loop: Dependencies satisfied

…CCGACGTTCCGT…Uncompressed input

CCGA(0,3)T(4,4)MRR

Tokens CCGA(0,3)T(0,3)T A) CompressionOnly search for matches w/o dependencies

B) DecompressionCopy back-references (fully parallel)

Bandwidth: 7 GB/s

Bandwidth: 16 GB/s

Page 35: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

Decompression Skyline

35

Byte-level encoding

Bit-level encoding

CPU: Dual-socket E5-2620 – Band. 102.4 GB/sGPU: Tesla K40 – Band. 288 GB/s

Page 36: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

GPUs on the Cloud• Cloud offerings

• AWS• Google Cloud• Microsoft Azure• IBM Softlayer• Nimbix

•Opportunity• Evaluate the usefulness of GPUs/FPGAs without the high investment

• Special considerations• Charging model• Scaling capabilities• Software licensing

36

Page 37: GPU-Acceleration of In-Memory Data Analytics 2017/Active17-Sitaridi.pdf · Design system by choosing the desired trade-offs. Outline oCPU vs GPU introduction oAccelerating wildcard

oAccelerated wildcard string searcho Insight: Change the layout of the strings in the GPU main memoryo3X speed-up & 2X energy savings against parallel state-of-the-art CPU libraries

oGompresso: Massively parallel decompression o Insight: Trade-off compression ratio for increased parallelismo2X speed-ups & 1.2X energy savings against multi-core state-of-the-art CPU libraries

oGPUs on the cloud: Open questions

37

Summary