t4: compiling sequential code for effective speculative

27
T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL SANCHEZ ISCA 2020 *University of Toronto starting Fall 2020

Upload: others

Post on 29-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: T4: Compiling Sequential Code for Effective Speculative

T4: Compiling Sequential Code for Effective Speculative

Parallelization in HardwareV IC TOR A . Y IN GM ARK C . JEFFREY *D AN IEL SAN C HEZ

I SC A 2 020

*University of Toronto starting Fall 2020

Page 2: T4: Compiling Sequential Code for Effective Speculative

Parallelization: Gap between programmers and hardware

Multicores are everywhere Programmers still write sequential code

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE 2

Speculative parallelization: new architectures and compilers to parallelize sequential code without knowing what is safe to run in parallel

Intel Skylake-SP (2017): 28 cores per die

1.…2.…3.…

Page 3: T4: Compiling Sequential Code for Effective Speculative

T4: Trees of Tiny Timestamped TasksOur T4 compiler exploits recently proposed hardware features:◦ Timestamps encode order, letting tasks spawn out-of-order

◦ Trees unfold branches in parallel for high-throughput spawn

◦ Compiler optimizations make task spawn efficient

◦ Efficient parallel spawns allows for tiny tasks (10’s of instructions)» Tiny tasks create opportunities to reduce communication and improve locality

We target hard-to-parallelize C/C++ benchmarks from SPEC CPU2006◦ Modest overheads (gmean 31% on 1 core)

◦ Speedups up to 49x on 64 cores

3ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

swarm.csail.mit.edu

Page 4: T4: Compiling Sequential Code for Effective Speculative

4ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Background

T4 Principles in Action

T4: Parallelizing Entire Programs

Evaluation

Background

T4 Principles in Action

T4: Parallelizing Entire Programs

Evaluation

Page 5: T4: Compiling Sequential Code for Effective Speculative

Thread-Level Speculation (TLS) [Multiscalar (’92-’98), Hydra (’94-’05), Superthreaded (’96), Atlas (’99),

Krishnan et al. (‘98-’01), STAMPede (‘98-’08), Cintra et al. (’00, ‘02), IMT (‘03), TCC (’04), POSH (‘06), Bulk (’06), Luo et al. (’09-’13), RASP (‘11), MTX (‘10-’20), and many others]

◦ Divide program into tasks (e.g., loop iterations or function calls)

◦ Speculatively execute tasks in parallel

◦ Detect dependencies at runtime and recover

Prior TLS systems did not scale many real-world programs beyond a few cores due to

◦ Expensive aborts

◦ Serial bottlenecks in task spawns or commits

5ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 6: T4: Compiling Sequential Code for Effective Speculative

TLS creates chains of tasks

Example: maximal independent set◦ Iterates through vertices in graph

One task per outer-loop iteration◦ Each tasks spawns the next

◦ Hardware tries to run tasks in parallel

Hardware tracks memory accesses to discover data dependences

6

Time

A

B

C

D

E

F …

for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))

state[nbr] = EXCLUDED;}

}

rd

wr

Indirect memory accesses

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 7: T4: Compiling Sequential Code for Effective Speculative

Task chains incur costly misspeculation recovery

Tasks abort if they violated data dependence

Tasks that abort must roll back their effects, including successors they spawned or forwarded data to

7

Time

D′E′

F′ …

rd

A

B

C

D

E

F …

A

B

C

D

E

F

rd

wr

for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))

state[nbr] = EXCLUDED;}

}

Unselective aborts waste a lot of workABORT

RE-EXECUTE

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 8: T4: Compiling Sequential Code for Effective Speculative

Swarm architecture[Jeffrey et al. MICRO’15, MICRO’16, MICRO’18; Subramanian et al. ISCA’17]

Execution model:◦ Program comprises timestamped tasks◦ Tasks spawn children with greater or

equal timestamp◦ Tasks appear to run sequentially,

in timestamp order

Detects order violations and selectively aborts dependent tasks

Distributed task units queue, dispatch, and commit multiple tasks per cycle◦ <2% area overhead◦ Runs hundreds of tiny speculative tasks

8ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 9: T4: Compiling Sequential Code for Effective Speculative

9ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Background

T4 Principles in Action

T4: Parallelizing Entire Programs

Evaluation

Page 10: T4: Compiling Sequential Code for Effective Speculative

T4’s decoupled spawn enables selective aborts

T4 compiles sequential C/C++ to exploit parallelism on Swarm

Put most work into worker tasks at the leaves of the task tree◦ Use Swarm’s mechanisms for

cheap selective aborts

10

for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))

state[nbr] = EXCLUDED;}

}

Time…

D′rd

wr

A

B

C

D

E

F

rd

Spawners

Workers

ABORT

RE-EXECUTE

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 11: T4: Compiling Sequential Code for Effective Speculative

Tiny tasks make aborts cheapIsolate contentious memory accesses into tiny tasks, to limit the damage when they abort

11ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

wr

wr wr

Time…

Time

wr

wr wr

for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))

state[nbr] = EXCLUDED;}

}

Parallelize outer loop only:

Parallelize both loops:Tiny tasks (a few instructions)are difficult to spawn effectively

Page 12: T4: Compiling Sequential Code for Effective Speculative

Tiny tasks (a few instructions)are difficult to spawn effectively

T4’s balanced task trees enable scalability

Spawners recursively divide the range of iterations

12

for (int v = 0; v < numVertices; v++) {if (state[v] == UNVISITED) {state[v] = INCLUDED;for (int nbr : neighbors(v))

state[nbr] = EXCLUDED;}

}

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

…………

Spawners

Workers

Spawners

Balanced spawner trees reduce critical path length to O(log(tripcount))

Page 13: T4: Compiling Sequential Code for Effective Speculative

13ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Background

T4 Principles in Action

T4: Parallelizing Entire Programs

Evaluation

Page 14: T4: Compiling Sequential Code for Effective Speculative

T4: Parallelizing entire real-world programsT4 divides the entire program into tasks starting from the first instruction of main()

T4 automatically generates tasks from◦ Loop iterations◦ Function calls◦ Continuations of the above

T4 extracts nested parallelism fromthe entire program despite◦ Loops with unknown tripcount◦ Opaque function calls◦ Data-dependent control flow◦ Arbitrary pointer manipulation

14ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 15: T4: Compiling Sequential Code for Effective Speculative

Progressive expansion of unknown-tripcount loops

Progressive expansion generates balanced spawnertrees for loops with unknown tripcount

- loops with break statements

- while loops

15ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

int i = 0;while (status[i]) {

if (foo(i)) break;i++;

}:

Source code: void iter(Timestamp i) {if (!done) {if (!status[i]) done = 1;else if (foo(i)) done = 1;

}}

0

iter(0)

iter(1)

4

iter(4)

iter(5)

2

iter(2)

iter(3)

6

iter(6)

iter(7)10

iter(10)

iter(11)8

iter(8)

iter(9)12

iter(12)

iter(13)

Page 16: T4: Compiling Sequential Code for Effective Speculative

Continuation-passing style eliminates the call stack

Problem: Independent function spawns serialize on stack-frame allocation

Solution:◦ When needed, T4 allocates continuation closures on the heap instead◦ T4 optimizations ensure most tasks don’t need memory allocation◦ These software techniques could apply to any TLS system

16ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

g

g

f

f

f

ffor (int i = 0; i < N; i++) {

float x = f();if (x > 0.0) g(x);

}

Page 17: T4: Compiling Sequential Code for Effective Speculative

Me

mo

ry con

troller / IO

Memory controller / IO

Memory controller / IO

Me

mo

ry c

on

tro

ller /

IO

0xE6823

Spatial-hint generation for localityTiny tasks may access only one memory location, which is known when the task is spawned.

Hardware uses these spatial hints to improve locality:◦ maps each address to a tile.

◦ Send tasks for that address to that tile.

17ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

A

C1

B

C2

Page 18: T4: Compiling Sequential Code for Effective Speculative

Manual annotations for task splitting

Programmer may add task boundaries for tiny tasks

Guaranteed to haveno effect on program output

Added <0.1% to source code

18ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 19: T4: Compiling Sequential Code for Effective Speculative

T4 implementation in LLVM/Clang

Intraprocedural passes: small compile times (linear in code size)Use all standard LLVM optimizations to generate high-quality codeMore in the paper:◦ Topological sorting to generate timestamps◦ Bundling stack allocations to the heap with privatization◦ Loop task coarsening to reduce false sharing of cache lines◦ Case studies and sensitivity studies

19ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Object fileClang

frontend

LLVM backend

Optimizations

(e.g., -O3)x86_64 code

generation

C/C++

source

code

T4 Parallelization

Passes

T4 Parallelization

Passes

Page 20: T4: Compiling Sequential Code for Effective Speculative

20ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Background

T4 Principles in Action

T4: Parallelizing Entire Programs

Evaluation

Page 21: T4: Compiling Sequential Code for Effective Speculative

Methodology1-, 4-, 16-, and 64-core systems

C/C++ benchmarks from SPEC CPU2006

All speedups normalized to serial code compiled with clang -O3

21ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

4-wide superscalar out-of-order cores (Haswell-like)

32 KB L1 caches1 MB L2 cache per 4-core tile4 MB L3 slice per 4-core tile

256 entries 64 entries/tile (1024 tasks for 64-core chip)

4 4×4 mesh networks

Page 22: T4: Compiling Sequential Code for Effective Speculative

T4 scales to tens of cores

22

Hot loops have some independent iterations

Hot loops have serializing variables updated every iteration

ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 23: T4: Compiling Sequential Code for Effective Speculative

T4 overheads are moderate

23ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Serial code, -O3 Parallelized with T4

Task-spawn overheads are geo. mean 31%

Page 24: T4: Compiling Sequential Code for Effective Speculative

Parallelization redoubles performance

24ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Cores spend most time executing useful work, not aborting

Para

llel s

pee

du

p

Page 25: T4: Compiling Sequential Code for Effective Speculative

Parallelization redoubles performance

25ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

T4 scales many programs to tens of cores

Para

llel s

pee

du

p

Page 26: T4: Compiling Sequential Code for Effective Speculative

ContributionsT4 compiler provides parallelization needed to allow sequential programmers to use multicores

T4 broadens the applications for which speculative parallelization is effective by exploiting the recent Swarm architecture◦ Parallelization of sequential C/C++ yields speedups of up to 49× on 64 cores

New code transformations:◦ Decoupled spawners enable cheap selective aborts of tiny tasks◦ Progressive expansion: balanced task trees for unknown-tripcount loops◦ Stack elimination and loop task coarsening reduce false sharing◦ Task spawn optimizations

26ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

Page 27: T4: Compiling Sequential Code for Effective Speculative

Questions?T4 is open-source and available for you to build on:

Join online Q&A @ ISCA: First paper in Session 2B on June 1, 20209am in Los Angeles, Noon in New York, 6pm in Brussels, Midnight in Beijing

27ISCA 2020 T4: COMPILING SEQUENTIAL CODE FOR EFFECTIVE SPECULATIVE PARALLELIZATION IN HARDWARE

swarm.csail.mit.edu