advanced 0 database systems - cmu 15-721 · →cores = intel p54c (aka pentium from the 1990s)....

Vectorized Execution

@Andy_Pavlo // 15-721 // Spring 2019

ADVANCEDDATABASE SYSTEMS

Le

ctu

re #

20

https://twitter.com/andy_pavlo

https://15721.courses.cs.cmu.edu/spring2019/

http://db.cs.cmu.edu/

CMU 15-721 (Spring 2019)

Background

Hardware

Vectorized Algorithms (Columbia)

2


http://15721.courses.cs.cmu.edu/

CMU 15-721 (Spring 2019)

VECTORIZATION

The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

3



CMU 15-721 (Spring 2019)

WHY THIS MAT TERS

Say we can parallelize our algorithm over 32 cores.

Each core has a 4-wide SIMD registers.

Potential Speed-up: 32x × 4x = 128x

4



CMU 15-721 (Spring 2019)

MULTI-CORE CPUS

Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.

Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.

5



CMU 15-721 (Spring 2019)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6



http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

https://en.wikipedia.org/wiki/P5_(microarchitecture)

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

CMU 15-721 (Spring 2019)

S INGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.

All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2,

AVX512→ PowerPC: Altivec→ ARM: NEON

8



CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

911111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

11111111

Y

SIMD

+

8 7 6 5

1 1 1 1

128-bit SIMD Register


x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 611111111

Y

SIMD

+

8 7 6 5

1 1 1 1


128-bit SIMD Register128-bit SIMD Register

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2019)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

SIMD

+4 3 2 1

1 1 1 1

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2019)

STREAMING SIMD EXTENSIONS (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers.

These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.

First introduced by Intel in 1999.

10



CMU 15-721 (Spring 2019)

SIMD INSTRUCTIONS (1 )

Data Movement→ Moving data in and out of vector registers

Arithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4

floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

11



CMU 15-721 (Spring 2019)

SIMD INSTRUCTIONS (2)

Comparison Instructions→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions→ Move data in between SIMD registers

Miscellaneous→ Conversion: Transform data between x86 and SIMD

registers.→ Cache Control: Move data directly from SIMD registers

to memory (bypassing CPU cache).

12



CMU 15-721 (Spring 2019)

INTEL SIMD EXTENSIONS

13

Width Integers Single-P Double-P

1997 MMX 64 bits ✔

1999 SSE 128 bits ✔ ✔(×4)

2001 SSE2 128 bits ✔ ✔ ✔(×2)

2004 SSE3 128 bits ✔ ✔ ✔

2006 SSSE 3 128 bits ✔ ✔ ✔

2006 SSE 4.1 128 bits ✔ ✔ ✔

2008 SSE 4.2 128 bits ✔ ✔ ✔

2011 AVX 256 bits ✔ ✔(×8) ✔(×4)

2013 AVX2 256 bits ✔ ✔ ✔

2017 AVX-512 512 bits ✔ ✔(×16) ✔(×8)

Source: James Reinders



https://www.youtube.com/watch?v=_OJmxi4-twY

CMU 15-721 (Spring 2019)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

14





CMU 15-721 (Spring 2019)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

14


Ease of Use

ProgrammerControl




CMU 15-721 (Spring 2019)

AUTOMATIC VECTORIZATION

The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

15



CMU 15-721 (Spring 2019)


This loop is not legal to automatically vectorize.

16

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}



CMU 15-721 (Spring 2019)



16

These might point to the same address!



}}

*Z=*X+1



CMU 15-721 (Spring 2019)



The code is written such that the addition is described as being done sequentially.

16

These might point to the same address!



}}

*Z=*X+1



CMU 15-721 (Spring 2019)

COMPILER HINTS

Provide the compiler with additional information about the code to let it know that is safe to vectorize.

Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.

17



CMU 15-721 (Spring 2019)

COMPILER HINTS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

18

void add(int *restrict X,int *restrict Y,int *restrict Z) {


}}



CMU 15-721 (Spring 2019)

COMPILER HINTS

This pragma tells the compiler to ignore loop dependencies for the vectors.

It’s up to you make sure that this is correct.

19


#pragma ivdepfor (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}



CMU 15-721 (Spring 2019)

EXPLICIT VECTORIZATION

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.

Potentially not portable.

20



CMU 15-721 (Spring 2019)


Store the vectors in 128-bit SIMD registers.

Then invoke the intrinsic to add together the vectors and write them to the output location.

21


__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i<MAX/4; i++) {_mm_store_si128(vecZ++,⮱_mm_add_epi32(*vecX++,

⮱*vecY++));}

}



CMU 15-721 (Spring 2019)

VECTORIZATION DIRECTION

Approach #1: Horizontal→ Perform operation on all elements together

within a single vector.

Approach #2: Vertical→ Perform operation in an elementwise

manner on elements of each vector.

22

Source:

0 1 2 3

SIMD Add 6

0 1 2 3

SIMD Add

1 1 1 1

1 2 3 4



https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/

CMU 15-721 (Spring 2019)


Linear Access Operators→ Predicate evaluation→ Compression

Ad-hoc Vectorization→ Sorting→ Merging

Composable Operations→ Multi-way trees→ Bucketized hash tables

23

Source: Orestis Polychroniou



http://www.cs.columbia.edu/~orestis

CMU 15-721 (Spring 2019)

VECTORIZED DBMS ALGORITHMS

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input

data per lane.→ Maximize lane utilization by executing different things

per lane subset.

24

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015



https://15721.courses.cs.cmu.edu/spring2019/papers/20-vectorization1/p1493-polychroniou.pdf


CMU 15-721 (Spring 2019)

FUNDAMENTAL OPERATIONS

Selective Load

Selective Store

Selective Gather

Selective Scatter

25



CMU 15-721 (Spring 2019)

FUNDAMENTAL VECTOR OPERATIONS

26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •



CMU 15-721 (Spring 2019)


26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U



CMU 15-721 (Spring 2019)


26

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V



CMU 15-721 (Spring 2019)


26

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask



CMU 15-721 (Spring 2019)


26


A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector


0 1 0 1Mask

B



CMU 15-721 (Spring 2019)


26


A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector


0 1 0 1Mask

B D



CMU 15-721 (Spring 2019)

Selective Gather


27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CA

0 21 3 54



CMU 15-721 (Spring 2019)

Selective Gather


27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW

0 21 3 54



CMU 15-721 (Spring 2019)

Selective Gather


27

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW V XZ

0 21 3 54



CMU 15-721 (Spring 2019)

Selective Gather


27

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector


2 1 5 3Index Vector

CAW V XZ

0 21 3 54

0 21 3 54



CMU 15-721 (Spring 2019)

Selective Gather


27

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector


2 1 5 3Index Vector

CAW V XZ B A CD

0 21 3 54

0 21 3 54



CMU 15-721 (Spring 2019)

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.

Gathers are only supported in newer CPUs.

Selective loads and stores are also implemented in Xeon CPUs using vector permutations.

28



https://software.intel.com/en-us/node/683481

CMU 15-721 (Spring 2019)

VECTORIZED OPERATORS

Selection Scans

Hash Tables

Partitioning

Paper provides additional info:→ Joins, Sorting, Bloom filters.

29

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015





CMU 15-721 (Spring 2019)

SELECTION SCANS

30

Scalar (Branching)

i = 0for t in table:key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Scalar (Branchless)

i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&⮱(key≤high ? 1 : 0)

i = i + m

Source: Bogdan Raducanu



https://dl.acm.org/citation.cfm?id=2465292

CMU 15-721 (Spring 2019)

SELECTION SCANS

31

Source: Bogdan Raducanu



https://dl.acm.org/citation.cfm?id=2465292

CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized

i = 0for vt in table:simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|



CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized



ID1

KEYJ

2 O3 Y4 S5 U6 X

SELECT * FROM tableWHERE key >= "O" AND key <= "U"



CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized



J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X




CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized




KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare




CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized




KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets




CMU 15-721 (Spring 2019)

SELECTION SCANS

32

Vectorized




KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SIMD Store

1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"



CMU 15-721 (Spring 2019)

SELECTION SCANS

33

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

5.7 5.6 5.35.7 4.9 4.3 2.8 1.3

1.7 1.7 1.71.81.6 1.4 1.5

1.2



CMU 15-721 (Spring 2019)

SELECTION SCANS

33

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)


MemoryBandwidth

MemoryBandwidth



CMU 15-721 (Spring 2019)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)



CMU 15-721 (Spring 2019)

PAYLOADKEY


HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1 k9=



CMU 15-721 (Spring 2019)

PAYLOADKEY


HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1

k9=

k3=

k8=

k1=



CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)



CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)


KEYS PAYLOAD


k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1



CMU 15-721 (Spring 2019)

HASH TABLES PROBING

34

Scalar

k1

Input Key

h1

Hash Index

#hash(key)


KEYS PAYLOAD


k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1

0 0 0 1Matched Mask

SIMD Compare



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35

Vectorized (Vertical)Input Key

Vector

k1

k2

k3

k4



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Gather

k1

k2

k3

k4



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6



CMU 15-721 (Spring 2019)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

35


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6



CMU 15-721 (Spring 2019)

HASH TABLES PROBING

36

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)


0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

2.3 2.2 2.12.41.1 0.9 0.7 0.6

1.1 1.10.9

1.2

0.8 0.8

0.30.2



CMU 15-721 (Spring 2019)

HASH TABLES PROBING

36

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)


0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Out of Cache

Out of Cache



CMU 15-721 (Spring 2019)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

37



CMU 15-721 (Spring 2019)




37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram



CMU 15-721 (Spring 2019)




37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram

Missing Update



CMU 15-721 (Spring 2019)




37

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

SIMD Add

# of Vector Lanes

SIMD Radix

+1

+2

+1

Histogram

SIMD Scatter



CMU 15-721 (Spring 2019)

JOINS

No Partitioning→ Build one shared hash table using atomics→ Partially vectorized

Min Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorized

Max Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized

38



CMU 15-721 (Spring 2019)

JOINS

39

0

0.5

1

1.5

2

Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning

Join

Tim

e (s

ec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT



CMU 15-721 (Spring 2019)

PARTING THOUGHTS

Vectorization is essential for OLAP queries.

These algorithms don’t work when the data exceeds your CPU cache.

We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.

40



CMU 15-721 (Spring 2019)

NEXT CL ASS

Compilation vs. Vectorization

41



advanced 0 database systems - cmu 15-721 · →cores = intel p54c (aka pentium from the 1990s)....

Documents