energy efficient computing systems exploiting online

31
Energy Efficient Computing Systems Exploiting Online Tuning and Output Quality Management Gianluca Palermo

Upload: others

Post on 18-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Energy Efficient Computing Systems Exploiting Online Tuning and Output Quality Management

Gianluca Palermo

Extra-Functional Properties

2

Func%onal)Descrip%on)

What%to%do…%%

How%It%is%done…%

Extra3Func%onal)Proper%es)

Extra-Functional Properties

3

Energy constraint

Power = f * C * V2 + Pstatic

Simple Power-Performance trade-off

4

Power%

Delay%

Pla5orm)Frequency%

Dynamic Power

???

Energy = Power * Delay Delay

constraint

Power constraint

Energy Vs Power

5

•  Power poses constraints •  E.g. power delivery or cooling solution

•  Energy is in most of the cases the ultimate metric •  It measures the cost of performing a fixed task

•  How can I reduce Energy? •  Voltage and Frequency Scaling •  Dynamic Power Switching •  […]

•  Small reductions in voltage can be very significant •  Power is proportional to frequency and square of

voltage •  Effective way to reduce energy by providing just-

enough computing power •  F and V are not independent variables •  Choosing the right VF operating point is not

straightforward

(Dynamic) Voltage and Frequency Scaling (DVFS)

6

K.)Choi,)R.)Soma,)M.)Pedram)“Dynamic)voltage)and)frequency)

scaling)based)on)workload)decomposi%on”)ISLPED)2004)

%1% %2% )3) %4%

Power%

1%

Execu<on%Time%

0.1%

)Freq)=)F)W

)DEADLINE)

Execu<on%Time%

Power%

%1% %2% )3) %4%

1%

0.1%

)DEADLINE)

W

)Freq)=)F/3)

•  Keeping devices powered-on consumes power •  Cut or reduce power on idle device portions

•  Significant overheads for switching to deep low-power states •  Different low-power states

•  Some of them need the context save/restore •  E.g. Retention states –> off states

•  Wakeup latency reduces the responsiveness •  In some contexts also called Race-to-Halt

Dynamic Power Switching (DPS)

7

%1% %2% %3% %4%

Power%

1% Overhead)

Execu<on%Time%

0.1%

%1% %2% %3% %4%

Power%

1%

Execu<on%Time%

0.1%

W

W

Is there any way to reduce the amount of computation to be performed to consume less?

… Longer Week-Ends are my Dream

8

Execu<on%Time%

Power%

%1% %2% )3) %4%

1%

0.1%

W W

Execu<on%Time%

Power%

%1% %2% )3) %4%

1%

0.1%

W’) W’)

Approximate Computing

Performance%

Energy%

Accuracy%

Performance%and%%Energy%Reduc<on%

Accuracy%

Performance%and%%Energy%Reduc<on%

9

•  Approximation adds complexity ‒  Not all codes can handle it

•  … but many expensive applications or kernels are naturally error-tolerant ‒  Make use of analog inputs

‒  E.g. operating noisy real-world data from noisy sensor ‒  Provide analog output

‒  E.g. targeting human perception ‒  Provide multiple good-enough results or no unique answer

‒  E.g. web search ‒  Compute iteratively towards convergence

‒  E.g. convergent applications over the number of iterations.

Approximate Computing Applications

10

V.)Chippa)et)al.“Analysis)and)characteriza%on)of)inherent)applica%on)resilience)for)approximate)compu%ng)”)DAC)2013.))

Possible Application Domains…

11

Image&Processing&Big&Data&Analy4cs&Robo4cs&

Drug&Discovery&

Graph&Analy4cs&

Traffic&Predic4on&

Mul4media&

…)where)100%)of)accuracy)not)always)required)

HOWEVER …

… we have to pay attention

12

No Traffic

Let’s have an example…

13

2 eyes = 3 dimensions

Left camera Right camera

Reference disparityK.)Zhang)et)al.)“Cross3Based)Local)Stereo)Matching)Using)Orthogonal)Integral)Images”.)IEEE)Transac%ons)On)Circuits)and)Systems)For)Video)Technology)2009)

Tunable Stereo Matching

14

Left camera Right camera

Reference disparity

*)Paone)et)al.)“An)Explora%on)Methodology)for)a)Customizable)OpenCL)Stereo3Matching)Applica%on)Targeted)to)an)Industrial)Mul%3Cluster)Architecture)")In)CODES+ISSS)2012)

1%

2%

3%

QoR Disparity

Error

5%Applica<on%%Knobs*%

Trading-off Accuracy

15

Accuracy)QoR

)

5FPS%1FPS%Performance))

10FPS%

1

2

3

Extra-functional requirements: What if … 1.  Performance = 4FPS 2.  Performance = F(Speed) 3.  Min Energy; QOR>50%; Perf>=1

2

1 2 3

2FPS%0.5FPS% 4FPS%

2

Execu<on%Time%

Power%

%New%Frame%

1%

0.1%

%New%Frame%

Execu<on%Time%

Power%

%New%Frame%

1%

0.1%

%New%Frame%

Idle)

Applying)VFS)

D.Gadioli)et)al.)“Applica%on)Autotuning)to)support)run%me)adap%vity)in)mul%core)architectures”)SAMOS)2015)

Architectures)and)Circuits)Use%of%Inexact%HW%(e.g.%Neural%

accelerators,%voltage%over%scaling,%approximate%memories)%

Compilers)and))Code)Transforma%ons)

Strategies%for%approxima<on%due%to%code%manipula<on%

Applica%ons))) Possible%applica<on%domains%and%knobs%to%tradeToff%Accuracy%and%Performance%%

In literature AC has been faced from different perspectives and at different levels

State of the Art…

16

S.)Migal,)“A)survey)of)techniques)for)approximate)compu%ng,”)ACM)Compu%ng)Surveys,)pp.)1–34,)2016.))

DSL%and%Language%supports%for%specifying%quality%requirements%and%

approxima<on%possibili<es%%

Fram

eworks)

•  Application parameters

•  Task skipping

•  Multiversioning •  Considering different performance/accuracy trade-offs

•  Approximate Memoization •  User partial key for the lookup

Application-level Approximations

17

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

Video Resolution #MC Samples

Computation model that represents an approximate application as a pipeline

J.)SanMiguel)et)al)"The)any%me)automaton.")

ISCA)2016.)

V.)Vassilliadis)et)al)")Exploi%ng)Significance)of)Computa%ons)for)Energy3Constrained)Approximate)

Compu%ng")IJPP)2016.)

•  Application parameters

•  Task skipping

•  Multiversioning •  Considering different performance/accuracy trade-offs

•  Approximate Memoization •  User partial key for the lookup

Application-level Approximations

17

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

Video Resolution #MC Samples

DSL or annotation based approaches%J)Ansel)et)al.)“PetaBricks:)a)language)and)compiler)for)algorithmic)choice”)PLDI)2009)

B.)Woongki)et)al)"Green:)a)framework)for)suppor%ng)energy3conscious)programming)using)controlled)approxima%on.")PLDI)2010)

•  Application parameters

•  Task skipping

•  Multiversioning •  Considering different performance/accuracy trade-offs

•  Approximate Memoization •  User partial key for the lookup

Application-level Approximations

17

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

Video Resolution #MC Samples

(p#&&#0xffff0000)#

M.)Alvarez)et)al)"Fuzzy)Memoiza%on)for)Floa%ng3Point)Mul%media)Applica%ons")TCOM)2005.)

-  Precision Scaling -  FP64->FP32->FP16 -  Float2Int -  Custom precision

-  Loop Perforation

Compiler Approaches

18

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

N.)Ho)et)al.)"Efficient)floa%ng)point)precision)tuning)for)approximate)compu%ng,")

ASP3DAC)2017.)

Modulo%Perfora<on%

Trunca<on%Perfora<on%

Randomized%Perfora<on%

rand( )

S.)Misailovic)et)al.)"Quality)of)service)profiling.")ICSE)2010)

Very large literature … Dual VDD architectures, Approximate Adders/Multipliers, NPU etc…

Approximate Hardware

19

Truffle%Core%

VDDH$

VDDL

CPU

•  Efficient%hardware%implementa<ons%

•  Capable%to%mimic%many%APPROXIMABLE&computa<ons%(a]er%training)%

•  Fault%tolerant%%

ADD

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

ld 0x04 r1 ld 0x08 r2 add.a r1 r2 r3 st.a 0x0c r3

H.)Esmaelizadeh)et)al.“Neural)Accelera%on)for)General3Purpose)Approximate)Programs”)MICRO)2012)

H.)Esmaelizadeh)et)al.)“Architecture)support)for)disciplined)approximate)programming”)ASPLOS)2012)

PARROT&

EDA perspective

S.)Venkataramani)et)al,)“SALSA:))systema%c)logic)synthesis)of)approximate)circuits”)DAC)2012)I.)Scarabogolo)et)al)“Circuit)Carving:)A)Methodology)for)the)Design)of)Approximate)Hardware”)DATE)2018)

EnerJ •  Java extension with for

Approximate Computation

•  Including Compiler and Runtime envisioning the usage of approximate HW

Approximate Frameworks (1)

20

int a = ...; int p = ...;

@approx @precise

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

A.)Sampson)et)al.)“EnerJ:)Approximate)Data)Types)for)Safe)and)General)Low3Power)Computa%on”)PLDI)2011)

Approximate Frameworks (2)

21 Silvano)et)al.)“Autotuning)and)adap%vity)in)energy)efficient)HPC)

systems:)the)ANTAREX)toolbox”)Compu%ng)Fron%ers)2018)

DSL WEAVER

Func%onal)Descrip%on)

C/C++%w/%OpenMP,%%MPI,%OpenCL,%Matlab%

Extra3Func%onal)Requirements)and)Approxima%on)aware3transforma%ons)E.g.))Adap%vity)features,)Tuning)Knobs,))))))))))Code)transforma%ons,)))))))))))Power/Performance)constraints)

DSL L A RA

AOP

Multiversioing aspect including Modulo Perforation

Applica%ons)))

Compilers)and))Code)Transforma%ons)

Architectures)and)Circuits)

Fram

eworks)

D.)Gadioli)et)al.)“SOCRATES:)A)seamless)online)compiler)

and)system)run%me)autotuning)framework)for)energy3aware)applica%ons”)

DATE)2017)

Need for monitoring the Quality of Result (QoR)

Controlling the Approximation

22

Golden))Model)

Approximate)Model)

Checker)

It)has)to)be)noted)that)establishing)a)suitable)error)measure)is)highly)applica%on)

dependent…)

Applica%on)Inputs)

Off3line)

On3line)…)and)the)value)is)data)dependent)

Online Monitoring: an Example

23

Applica<on%Inputs%

Approximate%Results%

&%Robust%

CPU)

NPU)

Recover%

Detect%

Rumba)Khudia)et)al.)“Rumba:)An)

Online)Quality)Management)System)for)Approximate)Compu%ng”)ISCA)2015)

Error-Aware Input classification Output Temporal Similarity

•  Pure autotuning approaches •  Error = f(x,i)

Off-line Quality Profiling

At design time: 1.  Instrument the application 2.  Perform a Design Space Exploration 3.  Store the Pareto front

knob1%=%5%knob2%=%1%

Objec<ve1%=%1000%Objec<ve2%=%50%

Operating Point

24

Configuring a Tunable Application

Input%

Tunable%Program%

Target%Pladorm%

RunT%Time%

App3Knobs)

DeployTTime%

Configuration file

25

•  Pure autotuning approaches •  Error = f(x,i)

•  Proactive and/or reactive approaches: •  Error < K => x’ : f(x’,i) < K

Off-line Quality Profiling (cont…)

26

EFP%Models%Dynamic)

)Autotuner)

Input% Features%

RunT<me%Tunable%Program%

EFP%Models%

EFP%Models%

EFP%Models%

Target%Pladorm%

RunTTime%

Knobs)

Goal%

DesignTTime%

Monitors%

•  Pure autotuning approaches •  Error = f(x,i)

•  Proactive and/or reactive approaches: •  Error < K => x’ : f(x’,i) < K

Off-line Quality Profiling (cont…)

27 D.)Gadioli)et)al.)“Applica%on)autotuning)to)support)run%me)

adap%vity)in)mul%core)architectures”)SAMOS15)hgps://gitlab.com/margot_project/)

Online*

Control*System*•  Use*machineOlearning*based*approach*to*build*error*model*fe*and*cost*

model*fc*•  Assume*p(i)"is"equal"for"all"inputs ""

–  easy"to"change"assumpKon"if"needed*

22*

Offline*

Blue"boxes"provided"by"programmers"**

Error*Model*fe*

Cost*Model*fc*

Model*Builder*

Training*Inputs*Training*Inputs*Training*Inputs*

Error*Metric*

Controller*

Input* (ε,*π)*

Tunable*Program*

Cost*Metric*

Profiler*

X.)Sui)et)al.)“Proac%ve)Control)of)Approximate)Programs”)ASPLOS)2016)

H.)Hoffmann)et)al.)"Dynamic)knobs)for)responsive)power3aware)compu%ng.")ASPLOS)2012)

CAPRI%

mARGOt%

PowerDial%

H.)Hoffmann))”JouleGuard:)[…]")SOSP)2015)

IRA – Input Responsive Approximation

Removing Offline Quality Profiling

28

mARGOt - AGORA

D.)Gadioli)et)al.)“mARGOt:)a)Dynamic)Autotuning)Framework)Targe%ng)Adap%vity)and)Controllable)Approxima%on”)SoxwareX)

M.)Laurenzano)et)al.)"Input)responsiveness:)using)canary)inputs)to)dynamically)steer)approxima%on.")PLDI)16.)

Conclusions… Why taking Care of Approximation?

29