future vector microprocessor xtensions...
TRANSCRIPT
Timothy Hayes, Oscar Palomar,
Osman Unsal, Adrian Cristal, Mateo Valero
FUTURE VECTOR MICROPROCESSOR
EXTENSIONS FOR DATA AGGREGATIONS
ISCA-43 (2016)
Motivation
Data generation growing at an exponential rate
Increasing demand to summarise/aggregate data quickly
Since ~2005, frequency scaling no longer viable
Explicit forms of parallelism must be used
Data-level parallelism (DLP) is excellent when available
Vector SIMD ISAs are highly efficient
Compact representation – Implicit parallelism – Scalable
Energy-efficient hardware implementations
Vector SIMD ISA perfect when DLP is regular
Many algorithms need transformations to be regular
Transformations often hurt performance
Future Vector Microprocessor Extensions for Data Aggregations2
Contributions
We examine applicability of vector SIMD to aggregations
Propose and evaluate different algorithms
1. With transformations to use typical vector instructions
2. Vectorise directly using our novel vector instructions
Evaluate with many datasets
Five unique distributions
Twenty-two cardinalities
Speedups between 2.7x and 7.6x over scalar baseline
Future Vector Microprocessor Extensions for Data Aggregations
This work is an extension to our HPCA-21 article– VSR Sort: A Novel Vectorised Sorting Algorithm. (2015) Hayes et al.
I will skip many things due to time constraints
3
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations4
What is a Data Aggregation?
Frequently occurring operation found in
SQL GROUP BY queries
MapReduce
Statistical Languages
OLAP Cubes
Reduction of key-value pairs
Aggregation function, e.g.
SUM
MINIMUM
MAXIMUM
AVERAGE
Future Vector Microprocessor Extensions for Data Aggregations5
7 5 13 25 85 33 9 44
valu
es
0 1 3 2 2 0 3 1keys
40+
What is a Data Aggregation?
Future Vector Microprocessor Extensions for Data Aggregations6
49+ 110+ 22+
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations7
Query and Algorithms
Scalar baseline – no vector instructions
Regular DLP – Typical vector instructions
A. Polytable
B. Standard Sorted Reduce [not in presentation]
Irregular DLP – Novel vector instructions
A. Advanced Sorted Reduce
B. Monotable
C. Partially Sorted Monotable
Future Vector Microprocessor Extensions for Data Aggregations
SELECT key, COUNT(*), SUM(value)
FROM table GROUP BY key
8
Datasets: Five Distributions
1. Uniform
2. Sorted
3. Sequential
4. Heavy Hitter
5. Zipfian
Future Vector Microprocessor Extensions for Data Aggregations9
4 8 2 3 6 7 4 4 1 6 6 7 1 5 2 4 8 1 3 1 2 2 3 3 7 8 5 5 7 6 5 8
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
3 8 2 3 6 7 3 4 1 3 6 3 3 5 3 4 8 3 3 1 3 2 3 3 7 3 3 5 7 3 5 8
6 8 4 3 2 1 2 2 3 4 8 7 6 1 2 7 3 1 2 5 4 1 5 1 3 1 1 1 1 2 1 5
Datasets: Cardinalities
Number of unique keys within dataset, e.g.
N = 10,000,000
C = 4 10,000,000
Grouped into four cardinality divisions1. Low cardinalities – many repeated keys
2. Low-normal cardinalities
3. High-normal cardinalities
4. High cardinalities – many unique keys
Future Vector Microprocessor Extensions for Data Aggregations10
N=8, C=11 1 1 1 1 1 1 1
N=8, C=21 1 2 1 2 2 2 1
N=8, C=42 1 1 3 4 2 4 1
N=8, C=83 2 8 5 4 6 1 7
Simulation Framework
Custom Simulation Framework
PTLsim – 32 nm Westmere microarchitecture
DRAMSim2 – DDR3-1333
Extended vector SIMD support
Heavily influenced from classical vector machines, e.g. CRAY-1
Emphasis on integer operations
16x vector registers with 64x 64bit elements
Pipelined functional units with 4x lockstepped parallel lanes
Masked operations
Indexed memory operations, i.e. gather/scatter
Integrated in out-of-order superscalar pipeline
Future Vector Microprocessor Extensions for Data Aggregations11
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations12
Scalar Baseline
Future Vector Microprocessor Extensions for Data Aggregations
1 5 5 3
27 19 43 31
0
0
0
0
0
keys
values
table
1
2
3
4
5
13
31
Scalar Baseline – Results
0
15
30
45
60
75
90
105
120
135
4 9
19 38 76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
Future Vector Microprocessor Extensions for Data Aggregations14
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations15
Polytable
Future Vector Microprocessor Extensions for Data Aggregations
1 5 5 3
27 19 43 31
0
0
0
0
0
keys
values
table
1
2
3
4
5
16
process m key-values
gather-modify-scatter conflict
Polytable
Future Vector Microprocessor Extensions for Data Aggregations
0
0
0
0
0
m local tables
1
2
3
4
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
17
1 5 5 3
27 19 43 31
keys
values
✓31
43 19
27
Polytable – Results
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
scalar
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
polytable
Future Vector Microprocessor Extensions for Data Aggregations18
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations19
Sorted Reduce
Future Vector Microprocessor Extensions for Data Aggregations
5 1 5 3
27 19 43 31
keys
values
Our new sorting algorithm from HPCA-21
Based on vectorised radix sort
Uses novel vector SIMD instructions
Avoids gather-modify-scatter conflicts
Vector Prior Instances (VPI)
20
Sorted Reduce
Future Vector Microprocessor Extensions for Data Aggregations21
inp
ut
ou
tpu
t
2 2 3 1
0 1 0 0
least significant element
most significant element
5 1 5 3
27 19 43 31
keys
values
VPI
Sorted Reduce
Future Vector Microprocessor Extensions for Data Aggregations
sort
22
output
5 1 5 3
27 19 43 31
keys
values
1 + reduce
3 + reduce
5 + reduce
5 5 3 1
27 43 31 19
19
31
70
Sorted Reduce – Results
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
scalar advanced sorted reduce
Future Vector Microprocessor Extensions for Data Aggregations23
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations24
Monotable
The Polytable algorithm needs to replicate tables
Avoids gather-modify-scatter conflicts
Hurts performance
The Sorted Reduce algorithm uses VSR Sort
VSR Sort uses VPI to resolve gather-modify-scatter conflicts
Could VPI also be used to optimise Polytable?
VPI is not sufficient, but…
Hardware could be reused
Create similar-style but different instruction
Vector Group Aggregate: SUM (VGAsum)
Similar to VPI but uses second vector of values
Vectorise scalar baseline without transformations
Future Vector Microprocessor Extensions for Data Aggregations25
Monotable
Future Vector Microprocessor Extensions for Data Aggregations26
va
lue
ou
tpu
t
6 12 3 5ke
y
2 0 2 2
VGAsummost significant element
least significant element
6 12 9 14
Monotable – Results
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
scalar monotable
Future Vector Microprocessor Extensions for Data Aggregations27
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations28
Partially Sorted Monotable
Losing locality hurts performance
Fully sorting can have a high overhead
VSR Sort has O(k.n) complexity
If we reduce the ‘k’, we reduce the overhead
92,345
0001,0110,1000,1011,1001
Future Vector Microprocessor Extensions for Data Aggregations
Ignore LSBssort on MSBs (partition)
k bits
key
29
Partially Sorted Monotable – Results
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
0
15
30
45
60
75
90
105
120
135
4 9
19
38
76
152
305
610
1,220
2,441
4,882
9,765
19,531
39,062
78,125
156,250
312,500
625,000
1,250,000
2,500,000
5,000,000
10,000,000
low low-normal high-normal high
cycl
es
pe
r tu
ple
uniform sorted sequential hhitter zipf
scalar partially sorted monotable
Future Vector Microprocessor Extensions for Data Aggregations30
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations31
Summary – Best Speedups Overall
1
2
3
4
5
6
7
8
low low-normal high-normal high
aver
age
spee
du
p o
ver
scal
ar
cardinality
uniform sorted sequential hhitter zipf
Future Vector Microprocessor Extensions for Data Aggregations
2.7x
7.6x
32
Summary – Best Speedups Overall
1
2
3
4
5
6
7
8
low low-normal high-normal high
aver
age
spee
du
p o
ver
scal
ar
cardinality
uniform sorted sequential hhitter zipfm
on
o
po
ly
mo
no
mo
no
mo
no
mo
no
po
ly
mo
no
mo
no
mo
no
ps-
mo
no
sorte
d r
ed
uce
mo
no
ps-
mo
no
ps-
mo
no
ps-
mo
no
mo
no
mo
no
ps-
mo
no
ps-
mo
no
Future Vector Microprocessor Extensions for Data Aggregations
2.7x
7.6x
33
Presentation Contents
I. Motivation
II. What is Data Aggregation?
III. Experimental Setup
IV. Algorithms
1. Scalar Baseline
2. Polytable
3. Sorted Reduce
4. Monotable
5. Partially Sorted Monotable
6. Summary
V. Conclusions
Future Vector Microprocessor Extensions for Data Aggregations34
Conclusions
Aggregating data quickly is important
DLP & SIMD is an attractive way to accelerate it
Aggregation algorithms are simple but DLP is irregular
We proposed various algorithms
A. Use transformations and typical vector SIMD instructions
B. Avoid transformations using our novel vector instructions
Evaluated using many data distributions and cardinalities
Speedups between 2.7x and 7.6x over scalar baseline
Best solution is dependent on input characteristics
Future Vector Microprocessor Extensions for Data Aggregations35