fast: fast architecture sensitive tree search on modern...

26
FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey SIGMOD 2010 Presented by: Andy Hwang

Upload: others

Post on 07-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

SIGMOD 2010

Presented by: Andy Hwang

Page 2: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

• Index trees are not optimized for architecture

• Only one node is accessed per tree level, ineffective cache line utilization • Prefetch cannot be used (depends on comparison of

search key to parent)

• Nodes in different pages, causing TLB misses

• Previous work optimized for page, cache, SIMD separately, not together

• Compression can be used to save memory bandwidth

2

Page 3: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation: Index Tree Layout 3

Bad for traversal

Page 4: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

4

Page 5: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Hierarchical Blocking 5

Optimize for accesses (SIMD/cache/memory)

Page 6: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Hierarchical Blocking 6

Page 7: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

7

Page 8: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction

• Assuming 4-byte keys (32-bits)

• Block size depends on SIMD instruction width, cache line size, and page size

• Use one SIMD instruction to calculate multiple indices

• Parallelize output amongst CPU cores / GPU shared multiprocessors

8

Page 9: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction: CPU

• 128-bit SIMD = max 4 nodes at once

• SIMD block = 2 tree levels (3 nodes)

• 64-byte cache line = max 16 nodes

• Cache line block = 4 levels (15 nodes)

• 2MB page size

• Page block = 19 levels

• 4KB page = 10 levels

9

Page 10: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction: GPU

• 32 data elements (thread warp)

• Various SIMD block sizes possible (up to 32)

• Set depth to 4 to make use of instruction granularity at half-warp

• No cache exposed – cache line block size set equal to SIMD block size

10

Page 11: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: CPU 11

Page 12: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: GPU 12

Page 13: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Simultaneous Queries

• Issue queries in parallel on the hardware

• Software pipelining used to hide cache/TLB miss or GPU memory latency

• CPU: 8 concurrent queries per thread, 64 total

• GPU: 2 concurrent queries per thread warp, 960 total

13

Page 14: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Optimization Speedup 14

Page 15: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

CPU vs GPU Search Throughput 15

Page 16: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: MICA

• Intel Many-Core Architecture Platform

• Intel GPGPU effort

• 32KB L1, 256KB L2 (partitioned)

• 4 threads/core

• Traversal code similar to CPU

• 16-wide SIMD

• SIMD block depth = 4 (15 nodes at once)

16

Page 17: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: MICA

Throughput (million queries / sec)

Small Tree (64K keys) Large Tree (16M keys)

CPU 280 60

GPU 150 100

MICA 667 183

17

Benefits of both CPU and GPU!

Page 18: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

18

Page 19: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression

• Key sizes are different in practice

• Impact cache line and page usage

• Non-Contiguous Common Prefix

• Hashing keys based on their difference (partial keys)

• 4-bit blocks as unit of compression

• SIMD instruction to find similarity and compress

19

Page 20: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression

• First page partial key size is larger (128 bits) to reduce false positives

• Subsequent pages have partial key size 32

• Construction overhead increased

• +75% for variable size keys, +30% integer keys

• During traversal, the query key is compressed

20

Page 21: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression 21

Page 22: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression: Alphabet Size 22

Page 23: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression: Throughput 23

Page 24: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Query Batching/Buffering 24

Page 25: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Summary

• Hierarchical blocking to optimize search tree for page, cache, SIMD instructions • Architectural-aware block depths

• CPU/GPU/MICA implementations • Fast construction, search, and parallel queries for

varying tree sizes

• Hide memory latency wherever possible • NCCP compression for integer and variable length

keys • Throughput/Response time for different query

batching schemes

25

Page 26: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Discussion

• Focus on throughput

• Assumes large number of queries

• Not much info on latency

• Updates

• Full reconstruction? Flushed from cache?

• Synthetic workloads

• Deployment

26