gt 4420/6422 // spring 2019 // @joy arulraj lecture #23: …jarulraj/courses/4420-s19/... · 2019....

129
DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23: VECTORIZED EXECUTION

Upload: others

Post on 25-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • DATABASE SYSTEM IMPLEMENTATION

    GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ

    LECTURE #23: VECTORIZED EXECUTION

  • ANATOMY OF A DATABASE SYSTEM

    Connection Manager + Admission Control

    Query Parser

    Query Optimizer

    Query Executor

    Lock Manager (Concurrency Control)

    Access Methods (or Indexes)

    Buffer Pool Manager

    Log Manager

    Memory Manager + Disk Manager

    Networking Manager

    2

    QueryTransactional

    Storage Manager

    Query Processor

    Shared Utilities

    Process Manager

    Source: Anatomy of a Database System

    http://cs.brown.edu/courses/cs295-11/2006/anatomyofadatabase.pdf

  • TODAY’S AGENDA

    BackgroundHardwareVectorized Algorithms (Columbia)

    3

  • VECTORIZATION

    The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

    4

  • WHY THIS MATTERS

    Say we can parallelize our algorithm over 32 cores.Each core has a 4-wide SIMD registers.

    Potential Speed-up: 32x × 4x = 128x

    5

  • MULTI-CORE CPUS

    Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.

    Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.

    6

  • MANY INTEGRATED CORES (MIC)

    Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

    Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

    Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

    7

    http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi

  • MANY INTEGRATED CORES (MIC)

    Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

    Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

    Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

    8

    http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi

  • MANY INTEGRATED CORES (MIC)

    Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

    Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

    Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

    9

    http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi

  • SINGLE INSTRUCTION, MULTIPLE DATA

    A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.

    All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512→ PowerPC: Altivec→ ARM: NEON

    11

  • SIMD EXAMPLE

    12

    X + Y = Zx1x2⋮xn

    y1y2⋮yn

    x1+y1x2+y2⋮

    xn+yn

    + =

  • Z

    SIMD EXAMPLE

    13

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    14

    X + Y = Z 87654321

    X

    SISD+for (i=0; i

  • Z

    SIMD EXAMPLE

    15

    X + Y = Z 87654321

    X

    SISD+for (i=0; i

  • Z

    SIMD EXAMPLE

    16

    X + Y = Z 87654321

    X

    SISD+for (i=0; i

  • Z

    SIMD EXAMPLE

    17

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    18

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    19

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    20

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    21

    X + Y = Z 87654321

    X

    for (i=0; i

  • Z

    SIMD EXAMPLE

    22

    X + Y = Z 87654321

    X

    for (i=0; i

  • STREAMING SIMD EXTENSIONS (SSE)

    SSE is a collection SIMD instructions that target special 128-bit SIMD registers.These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.

    First introduced by Intel in 1999.

    23

  • SIMD INSTRUCTIONS (1 )

    Data Movement→ Moving data in and out of vector registersArithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4

    floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MINLogical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

    24

  • SIMD INSTRUCTIONS (2)

    Comparison Instructions→ Comparing multiple data items (==,=,!=)Shuffle instructions→ Move data in between SIMD registersMiscellaneous→ Conversion: Transform data between x86 and SIMD

    registers.→ Cache Control: Move data directly from SIMD registers

    to memory (bypassing CPU cache).

    25

  • INTEL SIMD EXTENSIONS

    26

    Width Integers Single-P Double-P1997 MMX 64 bits

    1999 SSE 128 bits (×4)

    2001 SSE2 128 bits (×2)

    2004 SSE3 128 bits

    2006 SSSE 3 128 bits

    2006 SSE 4.1 128 bits

    2008 SSE 4.2 128 bits

    2011 AVX 256 bits (×8) (×4)

    2013 AVX2 256 bits

    2017 AVX-512 512 bits (×16) (×8)Source: James Reinders

    https://www.youtube.com/watch?v=_OJmxi4-twY

  • WHY NOT GPUS?

    Moving data back and forth between DRAM and GPU is slow over PCI-E bus.

    There are some newer GPU-enabled DBMSs→ Examples: MapD, SQream, Kinetica

    Emerging co-processors that can share CPU’s memory may change this.→ Examples: AMD’s APU, Intel’s Knights Landing

    27

    https://www.mapd.com/http://sqream.com/https://www.kinetica.com/

  • VECTORIZATION

    Choice #1: Automatic Vectorization

    Choice #2: Compiler Hints

    Choice #3: Explicit Vectorization

    28

    Source: James Reinders

    https://www.youtube.com/watch?v=_OJmxi4-twY

  • VECTORIZATION

    Choice #1: Automatic Vectorization

    Choice #2: Compiler Hints

    Choice #3: Explicit Vectorization

    29

    Source: James Reinders

    Ease of Use

    ProgrammerControl

    https://www.youtube.com/watch?v=_OJmxi4-twY

  • AUTOMATIC VECTORIZATION

    The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.

    Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

    30

  • AUTOMATIC VECTORIZATION

    31

    void add(int *X,int *Y,int *Z) {

    for (int i=0; i

  • AUTOMATIC VECTORIZATION

    This loop is not legal to automatically vectorize.

    The code is written such that the addition is described as being done sequentially.

    32

    void add(int *X,int *Y,int *Z) {

    for (int i=0; i

  • AUTOMATIC VECTORIZATION

    This loop is not legal to automatically vectorize.

    The code is written such that the addition is described as being done sequentially.

    33

    These might point to the same address!

    void add(int *X,int *Y,int *Z) {

    for (int i=0; i

  • AUTOMATIC VECTORIZATION

    This loop is not legal to automatically vectorize.

    The code is written such that the addition is described as being done sequentially.

    34

    These might point to the same address!

    void add(int *X,int *Y,int *Z) {

    for (int i=0; i

  • AUTOMATIC VECTORIZATION

    This loop is not legal to automatically vectorize.

    The code is written such that the addition is described as being done sequentially.

    35

    These might point to the same address!

    void add(int *X,int *Y,int *Z) {

    for (int i=0; i

  • COMPILER HINTS

    Provide the compiler with additional information about the code to let it know that is safe to vectorize.

    Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.

    36

  • COMPILER HINTS

    The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

    37

    void add(int *restrict X,int *restrict Y,int *restrict Z) {

    for (int i=0; i

  • COMPILER HINTS

    This pragma tells the compiler to ignore loop dependencies for the vectors.

    It’s up to you make sure that this is correct.

    38

    void add(int *X,int *Y,int *Z) {

    #pragma ivdepfor (int i=0; i

  • EXPLICIT VECTORIZATION

    Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.

    Potentially not portable.

    39

  • EXPLICIT VECTORIZATION

    Store the vectors in 128-bit SIMD registers.

    Then invoke the intrinsic to add together the vectors and write them to the output location.

    40

    void add(int *X,int *Y,int *Z) {

    __mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i

  • VECTORIZATION DIRECTION

    Approach #1: Horizontal→ Perform operation on all elements

    together within a single vector.

    Approach #2: Vertical→ Perform operation in an elementwise

    manner on elements of each vector.

    41

    Source: Przemysław Karpiński

    https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/

  • VECTORIZATION DIRECTION

    Approach #1: Horizontal→ Perform operation on all elements

    together within a single vector.

    Approach #2: Vertical→ Perform operation in an elementwise

    manner on elements of each vector.

    42

    Source: Przemysław Karpiński

    0 1 2 3

    SIMD Add 6

    https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/

  • VECTORIZATION DIRECTION

    Approach #1: Horizontal→ Perform operation on all elements

    together within a single vector.

    Approach #2: Vertical→ Perform operation in an elementwise

    manner on elements of each vector.

    43

    Source: Przemysław Karpiński

    0 1 2 3

    SIMD Add 6

    0 1 2 3

    SIMD Add

    1 1 1 1

    1 2 3 4

    https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/

  • EXPLICIT VECTORIZATION

    Linear Access Operators→ Predicate evaluation→ Compression

    Ad-hoc Vectorization→ Sorting→ Merging

    Composable Operations→ Multi-way trees→ Bucketized hash tables

    44

    Source: Orestis Polychroniou

    http://www.cs.columbia.edu/~orestis

  • VECTORIZED DBMS ALGORITHMS

    Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input

    data per lane.→ Maximize lane utilization by executing different things

    per lane subset.

    45

    RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

  • FUNDAMENTAL OPERATIONS

    Selective LoadSelective StoreSelective GatherSelective Scatter

    46

  • FUNDAMENTAL VECTOR OPERATIONS

    47

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

  • FUNDAMENTAL VECTOR OPERATIONS

    48

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

  • FUNDAMENTAL VECTOR OPERATIONS

    49

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

  • FUNDAMENTAL VECTOR OPERATIONS

    50

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

  • FUNDAMENTAL VECTOR OPERATIONS

    51

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

  • FUNDAMENTAL VECTOR OPERATIONS

    52

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U

  • FUNDAMENTAL VECTOR OPERATIONS

    53

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U

  • FUNDAMENTAL VECTOR OPERATIONS

    54

    Selective Load

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

  • FUNDAMENTAL VECTOR OPERATIONS

    55

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

  • FUNDAMENTAL VECTOR OPERATIONS

    56

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

  • FUNDAMENTAL VECTOR OPERATIONS

    57

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

  • FUNDAMENTAL VECTOR OPERATIONS

    58

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

    B

  • FUNDAMENTAL VECTOR OPERATIONS

    59

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

    B

  • FUNDAMENTAL VECTOR OPERATIONS

    60

    Selective Load Selective Store

    A B C DVector

    Memory

    0 1 0 1Mask

    U V W X Y Z • • •

    U V

    A B C DVector

    U V W X Y Z • • •Memory

    0 1 0 1Mask

    B D

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    61

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CA

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    62

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CA

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    63

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CA

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    64

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CA

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    65

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CAW

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    66

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • •

    CAW V XZ

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    67

    Selective Scatter

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • • A B C DValue Vector

    U V W X Y Z • • •Memory

    2 1 5 3Index Vector

    CAW V XZ

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    68

    Selective Scatter

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • • A B C DValue Vector

    U V W X Y Z • • •Memory

    2 1 5 3Index Vector

    CAW V XZ

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    69

    Selective Scatter

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • • A B C DValue Vector

    U V W X Y Z • • •Memory

    2 1 5 3Index Vector

    CAW V XZ

    0 21 3 54

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    70

    Selective Scatter

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • • A B C DValue Vector

    U V W X Y Z • • •Memory

    2 1 5 3Index Vector

    CAW V XZ

    0 21 3 54

    0 21 3 54

  • Selective Gather

    FUNDAMENTAL VECTOR OPERATIONS

    71

    Selective Scatter

    A B DValue Vector

    Memory

    2 1 5 3Index Vector

    U V W X Y Z • • • A B C DValue Vector

    U V W X Y Z • • •Memory

    2 1 5 3Index Vector

    CAW V XZ B A CD

    0 21 3 54

    0 21 3 54

  • ISSUES

    Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.

    Gathers are only supported in newer CPUs.Selective loads and stores are also emulated in Xeon CPUs using vector permutations.

    72

    https://software.intel.com/en-us/node/683481

  • VECTORIZED OPERATORS

    Selection ScansHash TablesPartitioning

    Paper provides additional info:→ Joins, Sorting, Bloom filters.

    73

    RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

  • SELECTION SCANS

    74

    SELECT * FROM tableWHERE key >= $(low)AND key

  • SELECTION SCANS

    75

    Scalar (Branching)

    i = 0for t in table:

    key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1

  • SELECTION SCANS

    76

    Scalar (Branching)

    i = 0for t in table:

    key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1

  • SELECTION SCANS

    77

    Scalar (Branching)

    i = 0for t in table:

    key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1

    Scalar (Branchless)

    i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&

    (key≤high ? 1 : 0)i = i + m

  • SELECTION SCANS

    78

    Scalar (Branching)

    i = 0for t in table:

    key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1

    Scalar (Branchless)

    i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&

    (key≤high ? 1 : 0)i = i + m

  • SELECTION SCANS

    79

    Scalar (Branching)

    i = 0for t in table:

    key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1

    Scalar (Branchless)

    i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&

    (key≤high ? 1 : 0)i = i + m

    Source: Bogdan Raducanu

    http://15721.courses.cs.cmu.edu/spring2017/papers/20-compilation/p1231-raducanu.pdf

  • SELECTION SCANS

    80

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    81

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    82

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    83

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    84

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    85

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

  • SELECTION SCANS

    86

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    SELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    87

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    ID1

    KEYJ

    2 O3 Y4 S5 U6 X

    SELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    88

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    J O Y S U XKey VectorID1

    KEYJ

    2 O3 Y4 S5 U6 X

    SELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    89

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    J O Y S U XKey VectorID1

    KEYJ

    2 O3 Y4 S5 U6 X

    Mask 0 1 0 1 1 0

    SIMD Compare

    SELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    90

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    J O Y S U XKey VectorID1

    KEYJ

    2 O3 Y4 S5 U6 X

    Mask 0 1 0 1 1 0

    SIMD Compare

    0 1 2 3 4 5All Offsets

    SELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    91

    Vectorized

    i = 0for vt in table:

    simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&

    (vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|

    J O Y S U XKey VectorID1

    KEYJ

    2 O3 Y4 S5 U6 X

    Mask 0 1 0 1 1 0

    SIMD Compare

    0 1 2 3 4 5All Offsets

    SIMD Store

    1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key

  • SELECTION SCANS

    92

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

  • SELECTION SCANS

    93

    0

    16

    32

    48

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Selectivity (%)

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    0.0

    2.0

    4.0

    6.0

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Selectivity (%)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    5.7 5.6 5.35.7 4.9 4.3 2.8 1.3

    1.7 1.7 1.71.8 1.6 1.4 1.5 1.2

  • SELECTION SCANS

    94

    0

    16

    32

    48

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Selectivity (%)

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    0.0

    2.0

    4.0

    6.0

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Selectivity (%)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

  • SELECTION SCANS

    95

    0

    16

    32

    48

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Selectivity (%)

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    0.0

    2.0

    4.0

    6.0

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Selectivity (%)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    MemoryBandwidth

    MemoryBandwidth

  • SELECTION SCANS

    96

    0

    16

    32

    48

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Selectivity (%)

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    0.0

    2.0

    4.0

    6.0

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Selectivity (%)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

  • SELECTION SCANS

    97

    0

    16

    32

    48

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Selectivity (%)

    Scalar (Branching)Scalar (Branchless)

    Vectorized (Early Mat)Vectorized (Late Mat)

    0.0

    2.0

    4.0

    6.0

    0 1 2 5 10 20 50 100

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Selectivity (%)

    MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

  • PAYLOADKEY

    Linear Probing Hash Table

    HASH TABLES – PROBING

    98

  • PAYLOADKEY

    Linear Probing Hash Table

    HASH TABLES – PROBING

    99

    Scalar

    k1Input Key

  • PAYLOADKEY

    Linear Probing Hash Table

    HASH TABLES – PROBING

    100

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

  • PAYLOADKEY

    Linear Probing Hash Table

    HASH TABLES – PROBING

    101

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    k1 k9=

  • PAYLOADKEY

    Linear Probing Hash Table

    HASH TABLES – PROBING

    102

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    k1 k9=

    k3=

    k8=

    k1=

  • HASH TABLES – PROBING

    103

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    Vectorized (Horizontal)

    KEYS PAYLOAD

    Linear Probing Bucketized Hash Table

  • HASH TABLES – PROBING

    104

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    Vectorized (Horizontal)

    KEYS PAYLOAD

    Linear Probing Bucketized Hash Table

    k1Input Key

    h1Hash Index

    #hash(key)

  • HASH TABLES – PROBING

    105

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    Vectorized (Horizontal)

    KEYS PAYLOAD

    Linear Probing Bucketized Hash Table

    k1Input Key

    h1Hash Index

    #hash(key)

    k9= k3 k8 k1k1

  • HASH TABLES – PROBING

    106

    Scalar

    k1Input Key

    h1Hash Index

    #hash(key)

    Vectorized (Horizontal)

    KEYS PAYLOAD

    Linear Probing Bucketized Hash Table

    k1Input Key

    h1Hash Index

    #hash(key)

    k9= k3 k8 k1k1

    0 0 0 1Matched Mask

    SIMD Compare

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    107

    Vectorized (Vertical)Input Key

    Vector

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    108

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    109

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k1k99k88k4

    ====

    SIMD Gather

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    110

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k1k99k88k4

    ====

    SIMD Compare

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    111

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k1k99k88k4

    ====

    SIMD Compare

    1001

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    112

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k1k99k88k4

    ====

    SIMD Compare

    1001

    k1k2k3k4

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    113

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k1k99k88k4

    ====

    SIMD Compare

    1001

    k1k2k3k4

    k5

    k6

    h5h2+1h3+1h6

  • PAYLOADk99

    k1

    k6

    k4

    KEY

    k5

    k88

    Linear Probing Hash Table

    HASH TABLES – PROBING

    114

    Vectorized (Vertical)Input Key

    Vector hash(key)

    ####

    Hash IndexVector

    h1h2h3h4

    k1k2k3k4

    k5

    k6

    h5h2+1h3+1h6

  • HASH TABLES – PROBING

    115

    Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

  • HASH TABLES – PROBING

    116

    0

    3

    6

    9

    12

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Hash Table Size

    Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    0

    0.5

    1

    1.5

    2

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Hash Table Size

  • HASH TABLES – PROBING

    117

    0

    3

    6

    9

    12

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Hash Table Size

    Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    0

    0.5

    1

    1.5

    2

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Hash Table Size

    2.3 2.2 2.12.4 1.1 0.9 0.7 0.6

    1.1 1.10.9

    1.2

    0.8 0.8

    0.30.2

  • HASH TABLES – PROBING

    118

    0

    3

    6

    9

    12

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Hash Table Size

    Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    0

    0.5

    1

    1.5

    2

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Hash Table Size

  • HASH TABLES – PROBING

    119

    0

    3

    6

    9

    12

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)

    Hash Table Size

    Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

    0

    0.5

    1

    1.5

    2

    4KB 16

    KB 64

    KB256

    KB 1M

    B 4M

    B 16

    MB 64

    MB

    Thro

    ughp

    ut(b

    illio

    n tu

    ples

    / se

    c)Hash Table Size

    Out of Cache

    Out of Cache

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    120

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    121

    k1k2k3k4

    Input KeyVector

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    122

    k1k2k3k4

    Input KeyVector

    h1h2h3h4

    Hash Index Vector

    SIMD AddSIMD Radix

    +1+1

    +1

    Histogram

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    123

    k1k2k3k4

    Input KeyVector

    h1h2h3h4

    Hash Index Vector

    SIMD AddSIMD Radix

    +1+1

    +1

    Histogram

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    124

    k1k2k3k4

    Input KeyVector

    h1h2h3h4

    Hash Index Vector Replicated Histogram

    +1+1

    +1

    +1

    SIMD Radix SIMD Scatter

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    125

    k1k2k3k4

    Input KeyVector

    h1h2h3h4

    Hash Index Vector Replicated Histogram

    +1+1

    +1

    +1

    # of Vector LanesSIMD Radix SIMD Scatter

  • PARTITIONING – HISTOGRAM

    Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.

    126

    k1k2k3k4

    Input KeyVector

    h1h2h3h4

    Hash Index Vector Replicated Histogram

    +1+1

    +1

    +1

    SIMD Add

    # of Vector LanesSIMD Radix

    +1+2

    +1

    Histogram

    SIMD Scatter

  • JOINS

    No Partitioning→ Build one shared hash table using atomics→ Partially vectorizedMin Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorizedMax Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized

    127

  • JOINS

    128

    0

    0.5

    1

    1.5

    2

    Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning

    Join

    Tim

    e (s

    ec)

    Partition Build Probe Build+Probe

    200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT

  • PARTING THOUGHTS

    Vectorization is essential for OLAP queries.These algorithms don’t work when the data exceeds your CPU cache.

    We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.

    129

  • NEXT CLASS

    Reminder: No class on 4/18 (Thu).

    Reminder: Guest lecture on 4/23 (Tue).

    Reminder: Extra credit due on 4/18 (Thu).

    Reminder: Final presentation on 4/25 (Thu).

    130