continuous shape shifting: enabling loop co-optimization...

86
October 18, 2016 Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting International Symposium on Microarchitecture (MICRO), 2016

Upload: nguyenhuong

Post on 30-Mar-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

October 18, 2016

Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars

Continuous Shape Shifting: Enabling Loop

Co-optimization via Near-Free Dynamic Code Rewriting

International Symposium on Microarchitecture (MICRO), 2016

Rampant Dynamism in Datacenters

Datacenters

Rampant Dynamism in Datacenters

Datacenters

Dynamism - Dynamic factors that affect application runtime environments

Rampant Dynamism in Datacenters

Datacenters

Co-running of applications

Dynamism - Dynamic factors that affect application runtime environments

Rampant Dynamism in Datacenters

Datacenters

Co-running of applications

Microarchitectural flexibility

Dynamism - Dynamic factors that affect application runtime environments

Rampant Dynamism in Datacenters

Datacenters

Co-running of applications

Microarchitectural flexibility

Platform diversity

Dynamism - Dynamic factors that affect application runtime environments

Rampant Dynamism in Datacenters

Datacenters

Co-running of applications

Microarchitectural flexibility

Platform diversity

Dynamism affects the runtime availability of resources

Dynamism - Dynamic factors that affect application runtime environments

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal Co-running

application

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal Co-running

application

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal Co-running

application

Partitioned

cache

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Normal Co-running

application

Partitioned

cache

Different

architecture

Static Compiler Optimizations

Compilation assumptions might not be met at runtime

Resource dependent static optimizations do not react to dynamism

Loop Tiling

Restructures memory access pattern to utilize data reuse

Conceptualized before multicore era, presenting little dynamism

Static

Ideal

Normal Co-running

application

Partitioned

cache

Different

architecture

Co-runner Tiling Comparison

Static vs Dynamic

Co-runner Tiling Comparison

Static vs Dynamic

Co-runner Tiling Comparison

Static vs Dynamic

Co-runner Tiling Comparison

Static vs Dynamic

Static vs Dynamic Static vs Dynamic

Co-runner Tiling Comparison

Dynamism requires rethinking cache tiling

Static vs Dynamic

Static vs Dynamic Static vs Dynamic

Design Objectives

Dynamic – Should react to changes in runtime environment

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Current techniques are not enough

White-box approaches

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Current techniques are not enough

White-box approaches

Dynamic Accuracy Low-overhead

White-box approach

BLAS libraries

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Current techniques are not enough

White-box approaches

Math kernel libraries like Intel MKL, ATLAS

Dynamic Accuracy Low-overhead

White-box approach

BLAS libraries

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Current techniques are not enough

White-box approaches

Math kernel libraries like Intel MKL, ATLAS

Dynamic Accuracy Low-overhead

White-box approach

BLAS libraries

Design Objectives

Dynamic – Should react to changes in runtime environment

High accuracy – Should identify a high-performance tiling strategy

Low overhead – Should have low dynamic performance overhead

Current techniques are not enough

White-box approaches

Math kernel libraries like Intel MKL, ATLAS

Online generation of a black-box model

Dynamic Accuracy Low-overhead

White-box approach

BLAS libraries

Shape Shifter

Key Components

Dynamic tile generation

Application 1

Tiled

loop

Key Components

Dynamic tile generation

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1 Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components

Dynamic tile generation

Detect tiling opportunities

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1 Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components

Dynamic tile generation

Detect tiling opportunities

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Runtime Environment Monitor (REM)

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components

Dynamic tile generation

Detect tiling opportunities

Find a high-performant tile

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Runtime Environment Monitor (REM)

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components

Dynamic tile generation

Detect tiling opportunities

Find a high-performant tile

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Tile

selector

Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Runtime Environment Monitor (REM)

Tile Selector

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components

Dynamic tile generation

Detect tiling opportunities

Find a high-performant tile

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Tile

selector

1 2

ZZ

Companion

controller

Application 2 Companion 2

ZZ

ShapeShifter

Companion thread (Protean Code + Polly)

Runtime Environment Monitor (REM)

Tile Selector

Protean Code, MICRO 2014 and Polly, PLDI 2008

Overview

Tile

selectorREM

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Online training

REM

Find tile size Training set Collect cache

stats

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Tile selection – generate black-box model and select suitable tile

shape

Online training Tile selection

REM

Find tile size Training set Collect cache

statsTile performance

model

Choose tile

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Tile selection – generate black-box model and select suitable tile

shape

Monitored execution – detect tiling opportunities

Online training Tile selection

REM

Monitored

executionFind tile size Training set Collect cache

statsTile performance

model

Choose tile

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Tile selection – generate black-box model and select suitable tile

shape

Monitored execution – detect tiling opportunities

Online training Tile selection

REM

Monitored

executionFind tile size Training set Collect cache

statsTile performance

model

Choose tile

Runtime environment change

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Tile selection – generate black-box model and select suitable tile

shape

Monitored execution – detect tiling opportunities

Online training Tile selection

REM

Monitored

executionFind tile size Training set Collect cache

statsTile performance

model

Choose tile

Runtime environment change

Dynamic

compiler

Tile Selection

Black-box model is generated online

Tile Selection

Training data

Black-box model

Black-box model is generated online

Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Tile Selection

Tile parameters

IPC

Training data

Black-box model

Black-box model is generated online

Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Tile Selection

Tile parameters

IPC

Training data

Black-box model

Black-box model is generated online

Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Tile Selection

Tile parameters

IPC

Training data

IPCpred

Set of tile shapes

of predicted sizeBlack-box model

Black-box model is generated online

Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Predicts a tile suitable to current runtime environment

Tile Selection

Tile parameters

IPC

Training data

IPCpred

IPCmax

Set of tile shapes

of predicted sizeBlack-box model

Black-box model is generated online

Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Predicts a tile suitable to current runtime environment

Tshapeshifter

Insight for Co-optimization

Challenging to retile multiple applications simultaneously

Insight for Co-optimization

Challenging to retile multiple applications simultaneously

Tile shape and tile size contribute differently to cache interference

Insight for Co-optimization

Challenging to retile multiple applications simultaneously

Tile shape and tile size contribute differently to cache interference

Insight for Co-optimization

Challenging to retile multiple applications simultaneously

Tile shape and tile size contribute differently to cache interference

Co-optimization – Find tile size for apps and then tile shape one-by-one

Experimental Evaluation

Methodology

Polybench application suite

Methodology

Polybench application suite

Three sources of dynamism

Co-running applications

Microarchitectural flexibility – cache partitioning

Platform diversity

Methodology

Polybench application suite

Three sources of dynamism

Co-running applications

Microarchitectural flexibility – cache partitioning

Platform diversity

Three platforms

AMD Bulldozer

Intel Haswell

Intel Atom

Methodology

Polybench application suite

Three sources of dynamism

Co-running applications

Microarchitectural flexibility – cache partitioning

Platform diversity

Three platforms

AMD Bulldozer

Intel Haswell

Intel Atom

Tiling is performed in the shared cache

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner change

syr2k to correlation

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner change

syr2k to correlation

Change in cache allocations

Microarchitectural Flexibility

Microarchitectural flexibility – cache partitioning

Static Best – best tile with no cache resizing (16-way enabled)

Microarchitectural Flexibility

Microarchitectural flexibility – cache partitioning

Static Best – best tile with no cache resizing (16-way enabled)

Microarchitectural Flexibility

Microarchitectural flexibility – cache partitioning

Static Best – best tile with no cache resizing (16-way enabled)

Microarchitectural Flexibility

Microarchitectural flexibility – cache partitioning

Static Best – best tile with no cache resizing (16-way enabled)

Microarchitectural Flexibility

Microarchitectural flexibility – cache partitioning

Static Best – best tile with no cache resizing (16-way enabled)

Platform Diversity

Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer

Static Best – best tile on AMD Bulldozer

Platform Diversity

Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer

Static Best – best tile on AMD Bulldozer

Platform Diversity

Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer

Static Best – best tile on AMD Bulldozer

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Adapt tiling strategy to the application runtime environment

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Adapt tiling strategy to the application runtime environment

Loop co-optimization – tiling multiple applications on the fly

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Adapt tiling strategy to the application runtime environment

Loop co-optimization – tiling multiple applications on the fly

Novel black-box modelling approach – fast and accurate

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Adapt tiling strategy to the application runtime environment

Loop co-optimization – tiling multiple applications on the fly

Novel black-box modelling approach – fast and accurate

ShapeShifter achieves significant performance improvements

across different sources of dynamism

Q/A

Why black-box model works?

There is trade-off between the best tiling stragey and performance

We show that SS chooses a close one

Why 3 D tiling?

Build on Polly but technique is not restricted to 3D tiling

Also memorize the compilation times

2 reasons of slowdown – tile doesn’t matter, black-box model not good enough

Remember cache sizes

Prior work – refresh

18

Overhead – Companion thread

Three sources of overhead

Dynamic Compilation – 136 ms on Intel Haswell, 430 ms on AMD

Bulldozer

Code redirection

Training

19

Overhead – training

20

Black-box model

Multiple high-performance tiles

ShapeShifter chooses one of the high-performanc e tiles

21

ShapeShifter vs Dynamic Oracle

ShapeShifter achieves 93% of the dynamic oracle performance on

average

22

Co-runner

23