unlocking gpu potential with jit - hpts · –dbms c: cpu-based, vector-at-a-time, simd, based on...
Post on 21-Oct-2020
13 Views
Preview:
TRANSCRIPT
-
Unlocking GPU potential with JITAnastasia Ailamaki
with Periklis Chrysogelos and Panos Sioulas
-
One hardware does not fit all
2Rethink query engines for accelerator-level parallelism
single-coresingle-CPU
multi-coresingle-CPU
multi-coremulti-CPUmulti-GPU
multi-coremulti-CPU
Time
-
0
1
2
> GPU memory < GPU memory
Exec
utio
n tim
e(v
s com
mer
cial
CPU
DBM
S)
Data Size
CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype
0
1
2
3
4
5
6
> GPU memory < GPU memory
Thro
ughp
ut(v
s com
mer
cial
CPU
DBM
S)
Data Size
CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype
3
One hardware fits all: The end of an efficient story
20%-80% throughput loss due to lack of portability
-
Designing query engines for heterogeneous HW
4Decomposition of design space to find sweet spot
HardwareOblivious
HardwareConscious
PerformancePortability
-
OLAP in heterogeneous servers: design space
5Selective obliviousness
intra-operator
inter-device
intra-device
Performance depends on μ-arch
Portability impacted by specialization Inject target-specific info using codegen
Limited device inter-operabilityEncapsulate heterogeneity and balance load
GPU⨝
CU⨝⨝σ
GPU⨝⨝σGPU⨝⨝σ
GPU⨝⨝σ
Tune operators to memory hierarchy specifics
[CIDR2019]
-
Optimizer can produce cross-device plans
Inter-device: HetExchange• Decouple data- from control-flow• Operators encapsulate trait conversions
6
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate[VLDB2019]
Flow Scope Trait
ControlDelegation Heterogeneous Parallelism
Routing Homogeneous Parallelism
DataTransfer Data Locality
Granularity Execution Granularity
filter
aggregate
scan
-
Device Boundary Crossings
• Cross-device pipelined execution
• Hand-over execution to next device
• Launch kernels/threads, synchronize, backpressure
• Only operators aware of device heterogeneity
7Encapsulate heterogeneous parallelism
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
-
Concurrent Execution
• Horizontal & Vertical parallelism
• Instantiate pipelines multiple times
• Routing policies: load-balance, partition, locality
8Encapsulate homogeneous parallelism
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
-
Data Transfers
• Handle memory transfers/prefetching
• Hide memory topology
• Overlap transfers with execution
9Hide memory heterogeneity
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
-
Execution Granularity
• Processing: in-registers => tuple-at-a-time
• Memory transfers: packets => block-at-a-time
• Transition between execution granularities
• Create homogeneous (reg. policy) packets
10Transition between execution granularities
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
-
filter
aggregate
scan
HetExchange
Logical plan
Heterogeneity-aware plans
11
Efficiency&
Operator portability
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
JIT
SELECT SUM(a)FROM TWHERE b > 42
-
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
9
8
4
3
2
1
7
10
11 5
6
devicecrossing
devicecrossing
JIT
pipeline id x
GPU pipelineCPU pipeline
instances
HetExchange in a JITed engine
12Generate pipelines and instantiate
Run
routing point
routing point
filter
aggregate
scan
Logical plan
SELECT SUM(a)FROM TWHERE b > 42
HetExchange
-
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
9
8
4
3
2
1
7
10
11 5
6
devicecrossing
devicecrossing
JIT
13
HetExchange in a JITed engine
-
14
Device providers
Inject target-specific info using the JIT infrastructure
-
nonpartitioned
Device-optimized operators• Same challenges• Similar algorithms• Different mappings
15Reuse algorithms, specialize mappings to hardware
⨝ σΓ
⨝ sort-merge⨝ ⨝⨝ radix-
Boncz et al.[VLDB1999]
Sioulas et al. [ICDE2019]
Fanout: L1 & TLB size Scratchpad size
ScratchpadL1Placement during probe:
⨝
Ξ
-
nonpartitioned
Hardware-dependent JIT code
16Lower generic description to device-specific code
⨝ σΓ
⨝ sort-merge⨝ ⨝⨝ radix-⨝
Ξ
scan
radix-join
scan
CPUProvider
GPUProvider
gpu radix-joincpu radix-joinDevice-optimized implementation
Device-independent implementation
Hardware-aware algorithm
Radix-joinCPU-mapping
Radix-joinGPU-mapping
-
Experimental Setup• 2x Intel Xeon E5-2650L v3 12-core @ 1.80GHz, 256GB RAM
• 2x NVIDIA GeForce GTX1080, 8GB, PCIe3 x16 per GPU
• DBMS C/G: state-of-the-art commercial DBMS– DBMS C: CPU-based, vector-at-a-time, SIMD, based on MonetDB/X100
– DBMS G: GPU-based, JIT engine
17
-
Performance on CPU-resident data
18Hybrid throughput = 88.5% (CPU-only + GPU-only), on average
0
50
100
150
200
DBMS C ProteusCPUs
ProteusHybrid
Projected ProteusGPUs
DBMS G
Exec
utio
n Ti
me
(s)
SSB SF1000, 600GB CSVworking set: 92-138GB / query
PCIe
-
A glimpse into the future• Effect of interconnects and GPU compute power• Access to high-throughput network
19
-
Adapting access method to query
Up to 45% speed-up by tuning access method to hardware
[CIDR2020]
0
20
40
60
80
100
PCIe-GeForce PCIe-Tesla NVLink-Tesla
Exec
utio
n ti
me
(s)
Interconnect technology and GPU type
Eager Lazy SemilazySSB SF1000, 600GB CSV
working set: 92-138GB / query2GPUs per configuration
GPU-only execution
-
Towards placing the CPU on the side
• Shared & limited PCIe buses to NIC/GPU
• Similar intra/inter-server BW
• Direct NIC-GPU access (RDMA)
21Avoid CPU bottleneck => Device-centric OLAP engines
Remote machineRemote storage
Remote machineRemote storage
~12GBps ~12GBps
~12GBps
~32GBps
~12GBps
-
JIT unleashes ALP
• Run on all available devices
• Relational operators oblivious to heterogeneity
• Fast: Inject target-specific information through codegen
• Result: 5x-10x versus CPU-/GPU-specialized systems
22
Thank you!
top related