fix the code. don’t tweak the hardware: a new compiler approach to voltage - frequency scaling

68
Uppsala University Department of Information Technology Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage - Frequency scaling Sweden Alexandra Jimborean Konstantinos Koukos Vasileios Spiliopoulos

Upload: thetis

Post on 23-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Sweden Alexandra Jimborean Konstantinos Koukos Vasileios Spiliopoulos David Black-Schaffer Stefanos Kaxiras. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage - Frequency scaling. Goal. Keep PERFORMANCE!. Motivation & Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Uppsala UniversityDepartment of Information Technology

Fix the code. Don’t tweak the hardware: A new compiler approach

to Voltage - Frequency scaling

Sweden

Alexandra Jimborean Konstantinos Koukos Vasileios Spiliopoulos David Black-Schaffer

Stefanos Kaxiras

Page 2: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Keep PERFORMANCE!Goal

Page 3: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Motivation & Background

DFVS power management … Quadratic power improvement (f, V2)

at most linear performance slowdown (f ) Dennard scaling problem

DVFS is dead! Long live DVFS!

Must make DVFS much more efficient Optimize the software to match the hardware

P≈Ceff f V2

DVFS: Dynamic Voltage Frequency ScalingBa

ckgr

ound

Automatically tune the code at compile time!

Page 4: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Max f - Coupled ExecutionCoupled Execution

fmax fmax fmax fmax fmax fmax fmax

Koukos et.al. Decoupled Access-Execute –ICS 2013-

Back

grou

nd Memory bound

Compute bound

Energy waste

Optimal f - Coupled Executionfop

t

fopt fopt fopt fopt fopt fopt

Performance loss

Page 5: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

fmax fmax fmax

Ideal DVFS (fmin - fmax) - Coupled Execution

Decoupled Execution (DAE)

Decoupled Execution

• Decouple the compute (execute) and memory (access)– Access: prefetches the data to the L1 cache– Execute: using the data from the L1 cache

• Access at low frequency and Execute at high frequency

fmax fmin fmax fmin fmax fmin fmax

fmin fmax

Access (prefetch) Execute (compute)

Koukos et.al. Decoupled Access-Execute –ICS 2013-

Back

grou

nd

Page 6: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Code generation and optimization for DVFS

Cont

ribut

ions

Automatically generated access phase by the compiler !

Page 7: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

How do we implement DAE? Input: Task-based parallel workloads Why?

Schedule tasks independently Control task size Easy to DAE automatically!

LLVM IR code transformations

Output: Split each task Access phase prefetch data Execute phase compute

Cont

ribut

ions

Page 8: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

fHigh

fLow

How do we implement DAE? Size the tasks to fit in the private cache (e.g.,L1)

Access phase: prefetches the data Remove arithmetic computation Keep (replicate) address calculation Turn loads into prefetches Remove stores to globals (no side effects)

Execute phase: original (unmodified) task Scheduled to run on same core right after Access

DVFS Access and Execute independently

Cont

ribut

ions

Save energy

Compute fast

Page 9: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

DAE CompilationAccess phase - created automatically by the compiler must: Be lean (not too many instructions) Prefetch the right amount of data

Derive the access patterns from the execute code and optimize their prefetch

1. Affine polyhedral model

2. Non-affine (complex control flow codes) skeleton with optimizations

Cont

ribut

ions

Page 10: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

1. Affine codes: Polyhedral model

for ( i = 0; i < N; i++) for ( j = 0; j < N; j++) prefetch A[i][j]

for ( i = 0; i < N; i++) for ( j = i+1; j < N; j++){

A[j][i]2 /= A[i][i]1; for ( k = i+1; k < N; k++)

A[j][k]3 -= A[j][i]2 * A[i]

[k]4; }

Cont

ribut

ions

Efficient prefetching!

Page 11: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Polyhedral algorithm

1. Compute linear functions for load instructions2. Compute the set of accessed memory locations per

instruction3. Compute the convex union of these sets

Cont

ribut

ions

Page 12: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Polyhedral algorithm

1. Compute linear functions for load instructions2. Compute the set of accessed memory locations per

instruction

3. Compute the convex union of these sets4. Generate the loop nest of minimum depth prefetching

them

Restricted to affine codes.

Highly efficient

prefetching!~30% of loops inour benchmarksCo

ntrib

utio

ns

Page 13: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

2. Non-affine codes, complex CFG

1. Clone the original task

2. Keep only instructions involved in: Memory address computation CFG

3. Replace load with prefetch instructions Discard store instructions to globals

4. Apply optimizations lightweight access code

Cont

ribut

ions

Page 14: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Optimizations: CFGExecute : for ( i = 0; i < N; i++) for ( j = i+1; j < N; j++){

if (cond)

A[j][i] += A[i][j];

A[i][j] -= N; }

Access : for ( i = 0; i < N; i++) for ( j = i+1; j < N; j++)

prefetch A[i][j];

Execute : for ( i = 0; i < N; i++) for ( j = i+1; j < N; j++){

if (cond)

A[j][i] += A[i][j];

A[i][j] -= N; }

Cont

ribut

ions

Improvements: profile & prefetch

the hot path.

Problem: complex CFG heavy access

phase.

Solution: only prefetch

addresses guaranteed to be

accessed.

Page 15: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Optimizations: address computationExecute : for ( i = 0; i < N; i++)

A[B[i]] -= ...;

Access : for ( i = 0; i < N; i++)

prefetch B[i];

Access : for ( i = 0; i < N; i++){

prefetch B[i];

x = load B[i]; prefetch A[x];}Co

ntrib

utio

ns

Improvements: profile address comp. overhead vs. prefetching

benefits.

Problem: address computation

requires memory accesses.

Solution: access memory and

prefetch only the last indirection.

Page 16: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Contributions in a nutshell

2 methods to generate the access phase:1. Polyhedral – affine codes

Analyze access pattern Generate minimal-depth loop

2. Optimized skeleton – non-affine codes Create clone Add prefetch, remove stores Optimize lightweight access version

Cont

ribut

ions

Page 17: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Methodology

Met

hodo

logy

Page 18: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Overcoming HW limitations Current hardware does not support:

Per-core DVFS Low-overhead DVFS But getting there …

Met

hodo

logy

Performance: Measured on real HW Energy: Modeled & verified on real HW

Model the benefits of per-core DVFSfor future hardware !

Spiliopoulos et.al. Green Governors –IGCC 2011

Page 19: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Evaluation

Eval

uatio

n

Page 20: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT RunTime Profile – 4 threads

Page 21: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

Page 22: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

Page 23: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

Page 24: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

Page 25: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

Page 26: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 27: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 28: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 29: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 30: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 31: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled execution

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax

FFT EnergyProfile – 4 threads

Page 32: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”FFT EnergyProfile – 4 threads

Page 33: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 34: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 35: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 36: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 37: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 38: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Performance “unaffected”

Page 39: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 40: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 41: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 42: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 43: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 44: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Coupled vs Manual DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

Energy savings Performance “unaffected”

Page 45: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 46: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 47: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 48: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 49: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 50: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 51: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 52: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 53: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 54: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 55: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 56: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAE

FFT RunTime Profile – 4 threads

Eval

uatio

n

FFT EnergyProfile – 4 threads

fmin fmax Access: fmin Execute: fmin fmax

AutoDAE access phase prefetches more data at fmin compared to ManualDAE

Energy savings Performance “unaffected”

Page 57: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

CAE vs M-DAE vs Auto DAEAutoDAE access phase prefetches more data at fmin

compared to ManualDAE

Energy savings Performance “unaffected”FFT EnergyProfile – 4 threadsFFT RunTime Profile – 4 threads

Eval

uatio

n

AutoDAE competitive with

hand crafted codes

fmin fmax Access: fmin Execute: fmin fmax

Page 58: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Putting it all together

Runtime (normalized to original at fmax)

The lower the better!

Eval

uatio

nEDP (normalized to original at fmax)

CAE: 26% EDP improvement

CAE:15% Performance degradation

Page 59: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Putting it all together

Runtime (normalized to original at fmax)

The lower the better!

Eval

uatio

nEDP (normalized to original at fmax)

CAE: 26% EDP improvementM-DAE: 20% EDP improvementA-DAE: 22% EDP improvement

CAE: 15% Perf. degradationM-DAE: <1% Perf. degradation

A-DAE: <1% Perf. degradation

500ns latency, per-core DVFS

Page 60: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Putting it all together

Runtime (normalized to original at fmax)Auto-DAE outperformsthe manually crafted

codes since the compiler derives the access

phase from an internallyoptimized task version.

Eval

uatio

nEDP (normalized to original at fmax) CAE: 26% EDP improvement

M-DAE: 23% EDP improvementA-DAE: 25% EDP improvement

CAE: 15% Per. degradationM-DAE: <4% Perf. degradation

A-DAE: <4% Perf. degradation

500ns latency, per-core DVFS

Page 61: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Putting it all together

Runtime (normalized to original at fmax)

Eval

uatio

nEDP (normalized to original at fmax) CAE: 26% EDP improvement

M-DAE: 22% EDP improvementA-DAE: 25% EDP improvement

CAE: 15% Per. degradationM-DAE: <4% Perf. degradation

A-DAE: <4% Perf. degradation

Coupled execution provides optimal EDP of 26% with 15%

performance degradation!

Automatic DAE provides close to optimal EDP of 25% with

less than 4%performance degradation!

500ns latency, per-core DVFS

Page 62: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Runtime(normalized to original at fmax)

Statistics

EDP (normalized to original at fmax)

Skeleton instead of polyhedral method:

75% overhead

Eval

uatio

nSlowdown

Polyhedral method -5% to 1%

Skeleton method -10% to 18%

Page 63: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Take-away message Goal: Save energy, keep performance Tool: DVFS

Problem: Fine granularity of memory- and compute-bound phases impossible to adjust f at this rate

Solution: Decoupled Access Execute

Contribution: Automatic DAE (compiler) Results: Equivalent to hand crafted code

(25% EDP improvement compared to original)

Conc

lusio

ns

Page 64: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Uppsala UniversityDepartment of Information Technology

Fix the code. Don’t tweak the hardware: A new compiler approach

to Voltage - Frequency scaling

Sweden

Alexandra Jimborean Konstantinos Koukos Vasileios Spiliopoulos David Black-Schaffer

Stefanos Kaxiras

Page 65: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Statistics # of loops that are detected to be affine:

6 out of 22 loops 2 out of 7 analyzed benchmarks

# of tasks per benchmark highly varying 45K - 50M 1-4 loops per benchmark granularity: 5-320 usec for the execute and 2-30 for

the access phase % of TA: 1.8%(LU, cholesky), 19% FFT, 42-49%

Page 66: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Evaluation: Optimizing DAE For some applications we can choose

better frequencies ! Access phase: Complex calculations

Slightly higher f than min is optimal Execute phase: Too many / irregular writes

Slightly lower f than max is optimal Optimal “OptEDP” policy

Use power and performance models to predict f,V that has the optimal EDP and set this at runtime

“Spiliopoulos et.al., Green governors: A framework for continuously adaptive DVFS.”

Page 67: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Understanding DVFS

Low Frequency

Maximum Frequency

Stall (MEM) X

Miss penalty

Maximum Frequency

Miss penalty

Low Frequency

Memory Bound Compute Bound

fmax fmin fmax fmin fmax fmin fmax

Impossible to DVFS this fast

How do we solve this problem ?

Energy waste

Real programs

Back

grou

nd

Performance Loss

Page 68: Fix the code. Don’t tweak the hardware: A new compiler approach to  Voltage - Frequency  scaling

Info

rmat

ions

tekn

olog

i

Institutionen för informationsteknologi | www.it.uu.se

Methodology

Performance: Measured on real HW Energy: Modeled & verified on real HW

Estimate Access / Execute

Energy for each f,V

Profile:

on real HWfor each f

Time

IPC PowerModel

Estimate Power Model Accuracy

Now we can model the benefits of per-core DVFSon real hardware !

Verify

Spiliopoulos et.al. Green Governors –IGCC 2011

Met

hodo

logy