energy efficient computing systems exploiting online
TRANSCRIPT
Energy Efficient Computing Systems Exploiting Online Tuning and Output Quality Management
Gianluca Palermo
Extra-Functional Properties
2
Func%onal)Descrip%on)
What%to%do…%%
How%It%is%done…%
Extra3Func%onal)Proper%es)
Energy constraint
Power = f * C * V2 + Pstatic
Simple Power-Performance trade-off
4
Power%
Delay%
Pla5orm)Frequency%
Dynamic Power
???
Energy = Power * Delay Delay
constraint
Power constraint
Energy Vs Power
5
• Power poses constraints • E.g. power delivery or cooling solution
• Energy is in most of the cases the ultimate metric • It measures the cost of performing a fixed task
• How can I reduce Energy? • Voltage and Frequency Scaling • Dynamic Power Switching • […]
• Small reductions in voltage can be very significant • Power is proportional to frequency and square of
voltage • Effective way to reduce energy by providing just-
enough computing power • F and V are not independent variables • Choosing the right VF operating point is not
straightforward
(Dynamic) Voltage and Frequency Scaling (DVFS)
6
K.)Choi,)R.)Soma,)M.)Pedram)“Dynamic)voltage)and)frequency)
scaling)based)on)workload)decomposi%on”)ISLPED)2004)
%1% %2% )3) %4%
Power%
1%
Execu<on%Time%
0.1%
)Freq)=)F)W
)DEADLINE)
Execu<on%Time%
Power%
%1% %2% )3) %4%
1%
0.1%
)DEADLINE)
W
)Freq)=)F/3)
• Keeping devices powered-on consumes power • Cut or reduce power on idle device portions
• Significant overheads for switching to deep low-power states • Different low-power states
• Some of them need the context save/restore • E.g. Retention states –> off states
• Wakeup latency reduces the responsiveness • In some contexts also called Race-to-Halt
Dynamic Power Switching (DPS)
7
%1% %2% %3% %4%
Power%
1% Overhead)
Execu<on%Time%
0.1%
%1% %2% %3% %4%
Power%
1%
Execu<on%Time%
0.1%
W
W
Is there any way to reduce the amount of computation to be performed to consume less?
… Longer Week-Ends are my Dream
8
Execu<on%Time%
Power%
%1% %2% )3) %4%
1%
0.1%
W W
Execu<on%Time%
Power%
%1% %2% )3) %4%
1%
0.1%
W’) W’)
Approximate Computing
Performance%
Energy%
Accuracy%
Performance%and%%Energy%Reduc<on%
Accuracy%
Performance%and%%Energy%Reduc<on%
9
• Approximation adds complexity ‒ Not all codes can handle it
• … but many expensive applications or kernels are naturally error-tolerant ‒ Make use of analog inputs
‒ E.g. operating noisy real-world data from noisy sensor ‒ Provide analog output
‒ E.g. targeting human perception ‒ Provide multiple good-enough results or no unique answer
‒ E.g. web search ‒ Compute iteratively towards convergence
‒ E.g. convergent applications over the number of iterations.
Approximate Computing Applications
10
V.)Chippa)et)al.“Analysis)and)characteriza%on)of)inherent)applica%on)resilience)for)approximate)compu%ng)”)DAC)2013.))
Possible Application Domains…
11
Image&Processing&Big&Data&Analy4cs&Robo4cs&
Drug&Discovery&
Graph&Analy4cs&
Traffic&Predic4on&
Mul4media&
…)where)100%)of)accuracy)not)always)required)
HOWEVER …
Let’s have an example…
13
2 eyes = 3 dimensions
Left camera Right camera
Reference disparityK.)Zhang)et)al.)“Cross3Based)Local)Stereo)Matching)Using)Orthogonal)Integral)Images”.)IEEE)Transac%ons)On)Circuits)and)Systems)For)Video)Technology)2009)
Tunable Stereo Matching
14
Left camera Right camera
Reference disparity
*)Paone)et)al.)“An)Explora%on)Methodology)for)a)Customizable)OpenCL)Stereo3Matching)Applica%on)Targeted)to)an)Industrial)Mul%3Cluster)Architecture)")In)CODES+ISSS)2012)
1%
2%
3%
QoR Disparity
Error
5%Applica<on%%Knobs*%
Trading-off Accuracy
15
Accuracy)QoR
)
5FPS%1FPS%Performance))
10FPS%
1
2
3
Extra-functional requirements: What if … 1. Performance = 4FPS 2. Performance = F(Speed) 3. Min Energy; QOR>50%; Perf>=1
2
1 2 3
2FPS%0.5FPS% 4FPS%
2
Execu<on%Time%
Power%
%New%Frame%
1%
0.1%
%New%Frame%
Execu<on%Time%
Power%
%New%Frame%
1%
0.1%
%New%Frame%
Idle)
Applying)VFS)
D.Gadioli)et)al.)“Applica%on)Autotuning)to)support)run%me)adap%vity)in)mul%core)architectures”)SAMOS)2015)
Architectures)and)Circuits)Use%of%Inexact%HW%(e.g.%Neural%
accelerators,%voltage%over%scaling,%approximate%memories)%
Compilers)and))Code)Transforma%ons)
Strategies%for%approxima<on%due%to%code%manipula<on%
Applica%ons))) Possible%applica<on%domains%and%knobs%to%tradeToff%Accuracy%and%Performance%%
In literature AC has been faced from different perspectives and at different levels
State of the Art…
16
S.)Migal,)“A)survey)of)techniques)for)approximate)compu%ng,”)ACM)Compu%ng)Surveys,)pp.)1–34,)2016.))
DSL%and%Language%supports%for%specifying%quality%requirements%and%
approxima<on%possibili<es%%
Fram
eworks)
• Application parameters
• Task skipping
• Multiversioning • Considering different performance/accuracy trade-offs
• Approximate Memoization • User partial key for the lookup
Application-level Approximations
17
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
Video Resolution #MC Samples
…
Computation model that represents an approximate application as a pipeline
J.)SanMiguel)et)al)"The)any%me)automaton.")
ISCA)2016.)
V.)Vassilliadis)et)al)")Exploi%ng)Significance)of)Computa%ons)for)Energy3Constrained)Approximate)
Compu%ng")IJPP)2016.)
• Application parameters
• Task skipping
• Multiversioning • Considering different performance/accuracy trade-offs
• Approximate Memoization • User partial key for the lookup
Application-level Approximations
17
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
Video Resolution #MC Samples
…
DSL or annotation based approaches%J)Ansel)et)al.)“PetaBricks:)a)language)and)compiler)for)algorithmic)choice”)PLDI)2009)
B.)Woongki)et)al)"Green:)a)framework)for)suppor%ng)energy3conscious)programming)using)controlled)approxima%on.")PLDI)2010)
• Application parameters
• Task skipping
• Multiversioning • Considering different performance/accuracy trade-offs
• Approximate Memoization • User partial key for the lookup
Application-level Approximations
17
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
Video Resolution #MC Samples
…
(p#&�xffff0000)#
M.)Alvarez)et)al)"Fuzzy)Memoiza%on)for)Floa%ng3Point)Mul%media)Applica%ons")TCOM)2005.)
- Precision Scaling - FP64->FP32->FP16 - Float2Int - Custom precision
- Loop Perforation
Compiler Approaches
18
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
N.)Ho)et)al.)"Efficient)floa%ng)point)precision)tuning)for)approximate)compu%ng,")
ASP3DAC)2017.)
Modulo%Perfora<on%
Trunca<on%Perfora<on%
Randomized%Perfora<on%
rand( )
S.)Misailovic)et)al.)"Quality)of)service)profiling.")ICSE)2010)
Very large literature … Dual VDD architectures, Approximate Adders/Multipliers, NPU etc…
Approximate Hardware
19
Truffle%Core%
VDDH$
VDDL
CPU
• Efficient%hardware%implementa<ons%
• Capable%to%mimic%many%APPROXIMABLE&computa<ons%(a]er%training)%
• Fault%tolerant%%
ADD
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
ld 0x04 r1 ld 0x08 r2 add.a r1 r2 r3 st.a 0x0c r3
H.)Esmaelizadeh)et)al.“Neural)Accelera%on)for)General3Purpose)Approximate)Programs”)MICRO)2012)
H.)Esmaelizadeh)et)al.)“Architecture)support)for)disciplined)approximate)programming”)ASPLOS)2012)
PARROT&
EDA perspective
S.)Venkataramani)et)al,)“SALSA:))systema%c)logic)synthesis)of)approximate)circuits”)DAC)2012)I.)Scarabogolo)et)al)“Circuit)Carving:)A)Methodology)for)the)Design)of)Approximate)Hardware”)DATE)2018)
EnerJ • Java extension with for
Approximate Computation
• Including Compiler and Runtime envisioning the usage of approximate HW
Approximate Frameworks (1)
20
int a = ...; int p = ...;
@approx @precise
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
A.)Sampson)et)al.)“EnerJ:)Approximate)Data)Types)for)Safe)and)General)Low3Power)Computa%on”)PLDI)2011)
Approximate Frameworks (2)
21 Silvano)et)al.)“Autotuning)and)adap%vity)in)energy)efficient)HPC)
systems:)the)ANTAREX)toolbox”)Compu%ng)Fron%ers)2018)
DSL WEAVER
Func%onal)Descrip%on)
C/C++%w/%OpenMP,%%MPI,%OpenCL,%Matlab%
Extra3Func%onal)Requirements)and)Approxima%on)aware3transforma%ons)E.g.))Adap%vity)features,)Tuning)Knobs,))))))))))Code)transforma%ons,)))))))))))Power/Performance)constraints)
DSL L A RA
AOP
Multiversioing aspect including Modulo Perforation
Applica%ons)))
Compilers)and))Code)Transforma%ons)
Architectures)and)Circuits)
Fram
eworks)
D.)Gadioli)et)al.)“SOCRATES:)A)seamless)online)compiler)
and)system)run%me)autotuning)framework)for)energy3aware)applica%ons”)
DATE)2017)
Need for monitoring the Quality of Result (QoR)
Controlling the Approximation
22
Golden))Model)
Approximate)Model)
Checker)
It)has)to)be)noted)that)establishing)a)suitable)error)measure)is)highly)applica%on)
dependent…)
Applica%on)Inputs)
Off3line)
On3line)…)and)the)value)is)data)dependent)
Online Monitoring: an Example
23
Applica<on%Inputs%
Approximate%Results%
&%Robust%
CPU)
NPU)
Recover%
Detect%
Rumba)Khudia)et)al.)“Rumba:)An)
Online)Quality)Management)System)for)Approximate)Compu%ng”)ISCA)2015)
Error-Aware Input classification Output Temporal Similarity
• Pure autotuning approaches • Error = f(x,i)
Off-line Quality Profiling
At design time: 1. Instrument the application 2. Perform a Design Space Exploration 3. Store the Pareto front
knob1%=%5%knob2%=%1%
Objec<ve1%=%1000%Objec<ve2%=%50%
Operating Point
24
Configuring a Tunable Application
Input%
Tunable%Program%
Target%Pladorm%
RunT%Time%
App3Knobs)
DeployTTime%
Configuration file
25
• Pure autotuning approaches • Error = f(x,i)
• Proactive and/or reactive approaches: • Error < K => x’ : f(x’,i) < K
Off-line Quality Profiling (cont…)
26
EFP%Models%Dynamic)
)Autotuner)
Input% Features%
RunT<me%Tunable%Program%
EFP%Models%
EFP%Models%
EFP%Models%
Target%Pladorm%
RunTTime%
Knobs)
Goal%
DesignTTime%
Monitors%
• Pure autotuning approaches • Error = f(x,i)
• Proactive and/or reactive approaches: • Error < K => x’ : f(x’,i) < K
Off-line Quality Profiling (cont…)
27 D.)Gadioli)et)al.)“Applica%on)autotuning)to)support)run%me)
adap%vity)in)mul%core)architectures”)SAMOS15)hgps://gitlab.com/margot_project/)
Online*
Control*System*• Use*machineOlearning*based*approach*to*build*error*model*fe*and*cost*
model*fc*• Assume*p(i)"is"equal"for"all"inputs ""
– easy"to"change"assumpKon"if"needed*
22*
Offline*
Blue"boxes"provided"by"programmers"**
Error*Model*fe*
Cost*Model*fc*
Model*Builder*
Training*Inputs*Training*Inputs*Training*Inputs*
Error*Metric*
Controller*
Input* (ε,*π)*
Tunable*Program*
Cost*Metric*
Profiler*
X.)Sui)et)al.)“Proac%ve)Control)of)Approximate)Programs”)ASPLOS)2016)
H.)Hoffmann)et)al.)"Dynamic)knobs)for)responsive)power3aware)compu%ng.")ASPLOS)2012)
CAPRI%
mARGOt%
PowerDial%
H.)Hoffmann))”JouleGuard:)[…]")SOSP)2015)
IRA – Input Responsive Approximation
Removing Offline Quality Profiling
28
mARGOt - AGORA
D.)Gadioli)et)al.)“mARGOt:)a)Dynamic)Autotuning)Framework)Targe%ng)Adap%vity)and)Controllable)Approxima%on”)SoxwareX)
M.)Laurenzano)et)al.)"Input)responsiveness:)using)canary)inputs)to)dynamically)steer)approxima%on.")PLDI)16.)