ISMS Keynote – 9 Feb 2015
A Model‐driven Approach forTime‐energy Performance of Parallel Applications
Yong Meng TEO+* and Lavanya RamapantuluDepartment of Computer ScienceNational University of Singaporeemail: [email protected]
url: www.comp.nus.edu.sg/~teoym
+ Visiting Professor, Chinese Academy of Sciences* Centre for Business Analytics, NUS
My InterestsRecent PhD Theses1. Specification and Verification of Shared‐memory Concurrent Programs,
Le Duy Khanh, Dec 2014.
2. Parallelism‐Energy Performance Analysis of Multicore Systems, B.M.Tudor, Jan 2014. [IPDPS 2012 PhD Forum Best Poster Award]
3. On Flash Crowd Performance of Peer‐Assisted File Distribution, C.Carbunaru, June 2014.
4. Strategy‐proof Resource Pricing in Federated Systems, M. Mihailescu,2012. [Best Paper Award ‐ 10th International Conference on Algorithmsand Architectures for Parallel Processing, May 2010]
5. Composable Simulation Models and their Formal Validation, ClaudiaSzabo, 2010. [ACM SIGSIM 2009 Best PhD Student Paper Award]
Teaching• CS5224 Cloud Computing, CS3210 Parallel Computing, CS5239 Computer
Systems Performance Analysis, …
9 Feb 2015 2ISMS 2015 Keynote
NUS School of Computing
Faculty of Arts and Social SciencesSchool of BusinessSchool of ComputingFaculty of DentistrySchool of Design and EnvironmentFaculty of EngineeringFaculty of LawYong Loo Lin School of MedicineYong Siew Toh Conservatory of MusicSaw Swee Hock School of Public HealthFaculty of ScienceUniversity Scholars ProgrammeYale‐NUS CollegeLee Kuan Yew School of Public PolicyNUS Graduate School for Integrative Sciences & EngineeringDuke‐NUS Graduate Medical School Singapore
• Established July 1998 (formerly DISCS within FoS)
• Departments: – Computer Science – Information Systems
• Staff strength:‐ 111 (academic staff)‐ 115 (research staff)
• Student Population~ 2330 (total):
‐1800 undergraduates‐530 graduate students
9 Feb 2015 ISMS 2015 Keynote
Recent Rankings
9 Feb 2015
Massachusetts Institute of Technology
Carnegie Mellon University
University of Cambridge
University of California, Berkeley
National University of Singapore
The Hong Kong University of Science and Technology
University of Edinburgh4ISMS 2015 Keynote
*
*
Faster is better
time (traditionally)
9 Feb 2015 5ISMS 2015 Keynote
& energy
Energy Use of Datacenters• Energy consumption of large‐scale data centers and its costs
are significant– 2006 ‐ 6,000 data centers in US consumed 61x109 KWh of
energy, 1.5% of all electricity consumption, at a cost of$4.5 billion
– 2006‐2011 ‐ from 7 GW to 12 GW, 10 new power plants
• 1998‐2007: performance of supercomputers (+7,000%) hasincreased 3.5 times faster than their operating efficiency*(+2,000%)
• Effort to reduce energy use is focused on computing,networking, and storage activities of a data center – our focus
*operating efficiency of a system = performance per Watt of power
9 Feb 2015 6ISMS 2015 Keynote
Datacenter Energy Usage
9 Feb 2015 7
Barroso L.A., et al., The Datacenter as a Computer: An Introduction to the Design of Warehouse‐Scale Machines, 2nd Edition, 2013
ISMS 2015 Keynote
Outline
• Motivation
• Objective
• Research Questions
• Time‐energy Performance [ICPP 2014]
• Heterogeneous Low‐power Systems
• Summary
9 Feb 2015 8ISMS 2015 Keynote
Motivation
• computing platforms are increasingly heterogeneous
– processors: brawny vs wimpy, big‐little, accelerators, …
– supercomputer with accelerators
– data centers with different server generations
– heterogeneous cloud computing resources with differentprice‐performance
9 Feb 2015 9ISMS 2015 Keynote
ARM Cortex‐A9
big.LittleARM A15 +A7
NVIDIA Jetson TK1• CPU: 4‐core ARM A15• GPU: 192‐core NVIDIA Kepler
10
010
100
90
80
70
60
50
40
30
20
Percentage of power usage
0 1009080706050403020
Percentage of systemutilization
Typical operating region
Energy efficiency
Power
System Utilization vs Percentage Power Usage
9 Feb 2015 10
1. a typical Google cluster: spends most ofits time in 10‐50% utilization range ‐ amismatch between server workloadprofile and server energy efficiency
2. energy‐proportional system (ideal):energy consumed is proportional to theamount of work done
ISMS 2015 Keynote
Energy‐proportional Systems
• Even when power requirements scale linearly with theload, energy efficiency is not a linear function of load;idle system use 50% power
• Ideal system consumes no power when idle, very littlepower under a light load and, gradually, more power asthe load increases
• Dynamic power range: low and upper range of thepower consumption of a device– Processor (70%), DRAM (50%), disk drive (25%), networkswitches (15%), human(??%)
– wider range is better
9 Feb 2015 11
Is human‐being an energy‐proportional system?Is human‐being an energy‐proportional system?
• idle (70W), average (120W), peak (1‐2KW)
• dynamic power range = 1 – 70/1000 > 90%
ISMS 2015 Keynote
Research Questions
1. Can we replace traditional servers with low‐powernodes ? [SIGMETRICS2013]
2. How do we configure energy‐efficientheterogeneous clusters (data centers)?
[ICPP2014, IPDPS2015]
3. What is the cost of processing big data on low‐power servers? [VLDB2015]
4. Is dataflow a suitable model of computation andscheduling to scale‐out workload on low‐powerservers? [PACT2013 workshop]
9 Feb 2015 12ISMS 2015 Keynote
General Objective
To develop models and techniques for dynamic resourceprovisioning to achieve energy efficient computing whilemeeting performance deadline
Approach:1. generalized analytic performance model for
configuring application resource demand (this talk)2. technique for runtime provisioning using
polymorphic tasks3. …..
9 Feb 2015 ISMS 2015 Keynote 13
Time‐energy Performance
L. Ramapantulu, B.M. Tudor, D. Loghin, T. Vu and Y.M.Teo, Modeling the Energy Efficiency of HeterogeneousClusters, Proc of 43rd International Conference onParallel Processing, Minneapolis, USA, Sep 2014.
9 Feb 2015 14ISMS 2015 Keynote
Reducing Power: Wimpy vs Brawny Servers
9 Feb 2015 15
power [W
]
Performance [MFLOPS]
Brawny node
Wimpy node
Marginal improvement in performance at high power
High idle power
ISMS 2015 Keynote
Objective
How do we configure energy‐efficientheterogeneous clusters (data centers)?
Given an application with an energy budget andan execution time deadline, determine efficientconfigurations to run the application
9 Feb 2015 16ISMS 2015 Keynote
Motivating Example Configuring Heterogeneous Systems
9 Feb 2015 ISMS 2015 Keynote 17
What is the total number of possible configurations to run anapplication with ten AMD and ten ARM nodes?
Total = 36,380 configurations[mix configurations = 10 ARM nodes x 4 cores per ARM nodex 5 core frequencies per ARM node x 10 AMD nodes x 6 cores per AMDnodes x 3 core frequencies per AMD node = 36,000] + [AMD only = 10 x 6 x 3 = 180]+ [ARM only = 10 x 4 x 5 = 200]
Contributions
• Model‐driven approach: measurement‐basedanalytical model to determine energy efficientconfigurations on a mix of heterogeneous nodes– Meets a time deadline with minimum energy
• Our analysis shows that energy‐deadline Paretofrontier consisting of heterogeneous mixes is almostalways more energy‐efficient than homogeneousclusters
9 Feb 2015 18ISMS 2015 Keynote
Model‐driven Approach
9 Feb 2015 19
energy-efficient Pareto-optimal configurations
baseline measurement
Non-intrusive Baseline Execution
Time-Energy Performance Model
Applications
system parameters
workload parameters
Heterogeneous Systems
• onsiders different • considers different ISAs
• resource overlap• unifying unit of
work
ISMS 2015 Keynote
Applications
9 Feb 2015 20
Domain Program Problem Size
HPC EP 2,147,483,648 random numbers
Web Server memcached 600,000 GET/SET operations
Streaming video x264 600 frames 704 × 576
Financial Black‐scholes 500,000 stock options
Speech recognition Julius 2,310,559 samples
Web security RSA‐2048 5000 keys verifications
ISMS 2015 Keynote
Heterogeneous System
• ARM v7‐A Cortex‐A9• quad‐core, 0.2 to 1.4GHz
9 Feb 2015 21
• AMD K10, x86_64• six‐core, 0.8 to 2.1GHz
ISMS 2015 Keynote
Baseline Execution
• Measurements needed only for a single node, foreach type of node– non‐intrusive hardware performance counters
• Execute the program for a very small problem size– measure instructions, computation cycles and stall cycles– Eg. measure instructions per GET operation of memcached
• Execute micro‐benchmarks to measure active andstall power of processor cores
9 Feb 2015 22ISMS 2015 Keynote
Execution Time Model
9 Feb 2015 23
Parallel ApplicationnARM nAMD
match the execution rates between ARM and AMD nodesT(nARM) ≈ T(nAMD)
within a type of nodeworkload is equally divided
T(nARM) ≈
nARM
T 1 ≈ max( T , T / [CPU and I/O overlap]
ISMS 2015 Keynote
Execution Time Model
≈ , + ,
, ≈
, ≈
• stall cycles increase linearly with – increase in core clock frequency – increase in the number of cores
9 Feb 2015 24ISMS 2015 Keynote
Stalls due to Memory Contention
9 Feb 2015 25ISMS 2015 Keynote
Energy Model
• Total Energy = EARM × nARM + EAMD × nAMD
• Enode = E(core) + E(mem) + E(I/O) + E(idle)
• E(core) = Pcore,act × Tcore,work + Pcore,stall × Tcore,stall– power × time – uses execution time model – measured values for Pcore,act , Pcore,stall , PI/O– Pmem,act ,Pmem,stall for ARM and AMD from literature and spec.
9 Feb 2015 26ISMS 2015 Keynote
Model SummaryExecution Time Model
T max(TARM,TAMD)
TARM max(TCPU,ARM,TI/O,ARM)
TCPU,ARM max(Tcore,ARM,Tmem,ARM)
Tcore,ARM Icore,ARM× (WPIARM+ SPIcore,ARM) fARM
Tmem,ARM Icore,ARM× (WPIARM+ SPImem,ARM) fARM
Ti/o,ARM max(TI/O,ARM , 1/λI/O)
Energy Model
E EARM +EAMD
EARM (Ecore,ARM +Emem,ARM +EI/O,ARM +Eidle,ARM) × nARM
Ecore,ARM (Pcore,act,ARM × Tact,ARM + Pcore,stall,ARM × Tstall,ARM) × cact, ARM
9 Feb 2015 27ISMS 2015 Keynote
Model Validation
9 Feb 2015 28ISMS 2015 Keynote
Performance‐to‐Power Ratio
9 Feb 2015 29
memory bound on ARM
x86 ISA has special instruction for cryptography
ISMS 2015 Keynote
Research Questions
1. Is heterogeneity better than homogeneity ?2. Are larger mixes of heterogeneous nodes
better ?3. …
9 Feb 2015 30ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 31
(36,380)
ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 32
(36,380)
ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 33
Heterogeneity
• Enables a sweet region
• Saves more energy for a given deadline
ISMS 2015 Keynote
Are larger mixes better ?
9 Feb 2015 34
• Larger mixes are more energy efficient
• Enables more number of “sweet spots”
ISMS 2015 Keynote
Observations
1. Heterogeneity allows larger energy savingscompared to homogeneous systems.
2. Larger mixes increase the number ofconfigurations in the sweet region.
3. …
9 Feb 2015 35ISMS 2015 Keynote
Conclusions• measurement‐driven analytical model to determineenergy‐efficient configurations for a single workloadon a heterogeneous mix with different ISA’s
• heterogeneity is almost always more energy‐efficientthan homogeneity– But not for programs with large sequential fraction andhigh parallel overhead
L. Ramapantulu, B.M. Tudor, D. Loghin, T. Vu and Y.M.Teo,Modeling the Energy Efficiency of Heterogeneous Clusters,Proceeding of 43rd International Conference on ParallelProcessing, Minneapolis, USA, Sep 2014.
9 Feb 2015 36ISMS 2015 Keynote
Heterogeneous Low‐power Systems
1. Nov 2014 – 12‐node Heterogeneous CPU‐GPU Cluster (JetsonTK1) with 44 ARM cores & 2,304 GPU cores
2. Aug 2014 – 32‐core ODROID‐XU3 big.LITTLE (ARM A15 + A7)
3. Jun 2014 – Brawny and Wimpy Systems with GPUs (NVIDIA GTX750 Ti)
4. Oct 2013 – Heterogeneous Low‐power system with GPU (NVIDIA Tegra 3 based Kayla platform)
5. Sep 2013 – 32‐core ORDOID XU+E big.LITTLE (ARM A15 + A7)
29 June 2015 Computer Systems Group ‐ Clusters 37
big.LITTLE Node: ODROID XU+E (Sep 2013)
• CPU: Samsung Exynos 5410 Octal with ARM Cortex‐A15 (1.6GHz) quad core + Cortex‐A7 quad core + GPU
• Memory: 2GB LPDDR3• Power: 5V, 4A
29 June 2015 38Computer Systems Group ‐ Clusters
big.LITTLE ODROID XU+E Cluster (Sep 2013)16 A15 + 16 A7
29 June 2015 39Computer Systems Group ‐ Clusters
Brawny & Wimpy Systems with GPU (June 2014)*
29 June 2015 40
Power Line (240V AC)
Serial interface
Dell Optiplex
Kayla DevKit
+
NVIDIA GTX 750 Ti
NVIDIA GTX 750 TiPower Monitor
Brawny System with GPU
+
Controller
1 Gbps
1 Gbps
Wimpy System with GPU
*Big Data on Heterogeneous Systems with GPUs, Nvidia GPU TechnologyWorkshop, Singapore, July 2014.
Computer Systems Group ‐ Clusters
6‐node Nvidia Jetson TK1 Cluster (Nov 2014)
29 June 2015 Computer Systems Group ‐ Clusters 41
CPU: 4‐core ARM Cortex‐A15GPU: 192‐core NVIDIA KeplerMemory: 2GB LPDDR3Storage: 16GB eMMC 4.51Network: 1GbpsPower: 12V, 5A
Publicationshttp://www.comp.nus.edu.sg/~teoym
1. D. Loghin, B.M. Tudor, H. Zhang, B.C. Ooi and Y.M. Teo, A Performance Study of BigData on Small Nodes, Proc of 41st International Conference on Very Large DataBases, Vol. 8, No. 7, Hawaii, USA, Aug 31‐Sep 4, 2015.
2. L. Ramapantulu, D. Loghin and Y.M. Teo, An Approach for Energy EfficientExecution of Hybrid Parallel Programs, Proceedings of 29th IEEE InternationalParallel & Distributed Processing Symposium, Hyderabad, INDIA, May 25‐29, 2015(acceptance 22%).
3. D. Loghin, B.M. Tudor and Y.M. Teo, An Approach for Direct Dataflow Executionon Contemporary Multicore Systems, Proc of 3rd International Workshop onDataflow Execution Models for Extreme Scale Computing, IEEE Computer SocietyPress, in conjunction with PACT2013, Edinburgh, Scotland, Sep 2013.
4. B.M. Tudor and Y.M. Teo, On Understanding the Energy Consumption of ARM‐based Multicore Servers, ACM SIGMETRICS, Carnegie Mellon University,Pittsburgh, USA, June 17 ‐ 21, 2013 [acceptance: 27 of 196] (featured article inHPCwire: Mapping the Energy Envelope of Multicore ARM Chips, 6 June 2013).
5. B.M. Tudor and Y.M. Teo, Towards Modeling Parallelism and Energy Performanceof Multicore Systems, Proc of 26th IEEE International Parallel & DistributedProcessing Symposium, Shanghai, China, May 21‐25, 2012. [PhD Forum Best PosterAward]
6. B. Tudor, Y.M. Teo and S. See, Understanding Off‐chip Contention of ParallelPrograms in Chip Multiprocessors, Proc. of 40th International Conference onParallel Processing, Taipei, Taiwan, Sep 2011 (acceptance 19%).
9 Feb 2015 42ISMS 2015 Keynote
Questions & AnswersThank you!
http://www.comp.nus.edu.sg/~teoymEmail: [email protected]
Acknowledgements
Computer Systems Group, NUS
FundingsNational Research Foundation, Ministry of Education,
Nvidia, Oracle (Sun), …
9 Feb 2015 43ISMS 2015 Keynote