AYŞE TOSUN 2008800072
CMPE511COMPUTER ARCHİTECTURE
25.12.2008
Power Efficiency and Voltage Scaling on CMP
OUTLINE
“Compiler Directed Proactive Power Management for Networks”
“Reducing Energy Consumption of On-Chip Network Through a Hybrid Compiler-Runtime Approach” Proposed Approach Results
Energy Consumption Performance Penalty
F.Lİ, G. CHEN, M. KANDEMIR, M.J. IRWIN
Compiler Directed Proactive Power Management for
Networks
Introduction
Parallel computation platforms (on-chip and off-chip) makes power/energy optimization an important issue.
The most common prior approach Hardware-based Reactive
They manage power consumption of network as a dynamic response to message traffic.
They control network power by observing past network traffic activity and estimating future ones.
Disadv Missing important opportunities for saving power Incuring performance penalties due to inaccuracies in
predicting idle and active times.
Motivation
Potential power penalty unnecessarily waiting to ensure the link is idle. necessary transition time of the link from power-down
to power-on state when next request comes.Goal is to eliminate these problems of
reactive power management schemesPROACTIVE andCOMPILER-BASED approach forLOOP INTENSIVE applications, whose data
access and comm. patterns can be captured at COMPILE TIME.
Proposed Approach
Compiler algorithm in deciding1. Which links of network can be turned off2. When they can be turned off3. When to reactive off links
Experiments on embedded on-chip networksExperiments
Effective in reducing network energy More power savings No observable performance penalty
Network Architecture and Hardware Support
MxN mesh architecturepi = ith nodepi and pj are adjacent to each other and they are
connected with links i j and jiAssumption: System runs a single embedded
application, a set of parallel processes, at a given time. Each node executes at most one process. Set of links involved in a connection btw two not-
adjacent nodes are determined by a specific routing algorithm (X-Y routing algorithm).
A message is first passed in x-dim. and then in y-dim.
Network Architecture
To support link shut downTo monitor link
utilization in sender node
•At each clock ticktime-out counter –
•When it reaches zeroTx and Rx are turned off to conserve energy
When 0, link is turned off
When 1, link is shared
When 1: links won’t be used for a long period of time
When 1: sender will re-use link
Network Architecture
Using control flags:1.Controls states of the links.2.Keeps tracks of roads from source to dest. When they are not adjacent to each other.LAST : program can turn off idle links earlier than pure time-out hw mechanism.HOLD and SHARED: program can prevent nodes from turning off links that are still needed and reduces potential power/energy penalties.
High-Level Power Parameters
To manage link power at compiler level:
Compiler turns off a link if it predicts that link will be idle for a time period that is longer than threshold, T.
Compiler SupportSetting Link Control Flags for Link
Turnoffs
To compute iteration sets H and G: Presburger formulas.Finding optimal H and G is very hard if not impossible.
Connection(i,j): set of links usedTargets(i,I): set of nodes i sends msg at iteration I.Links(i,I): links used by i at iter. I. Computed using connection & targets.Use(i,I,q): set of links used by i during iteration I and I+q.
Compiler SupportInserting Pre-activation Instructions
Pre-activation: we can turn on a link that is currently turned off a certain number of cycles before the link is actually used.
In pre-activation, a communication link is activated before it’s needed to escape from incurring the associated reactivation latency.
Example
Discussion
Their approach requires following two conditions to be satisfied: Message routing in the network must be static.
The set of links used to transfer a message from one node to another must be determined at compiler level.
Message passing behavior of embedded application must be predictable at compilation time. Many parallel applications satisfy this req., as they are
array/loop intensive codes with rare conditional flow of execution.
Although they apply the system on a single embedded application, it is not strictly necessary for our approach to be applicable.
Experimental Results
They introduced a link power model and power simulator to compare their approach with hw-based approach.
Experimental Results
Power-on Link energy consumed in link power-on states
Reactivation Energy penalty for reactivation
Results of SW are normalizedwrt HW results.
Compiler-based saves more energy, since it shuts down a link proactively.
When it decides to turn off a link, link does not need to wait for some time to turn off.
Since it is proactive instead of reactive, it assess benefits of link turnoff more accurately.
However, it incurs extra energy overheads in processors and switches due to extra instructions and logic circuits.
But they are negligible when compared to others.
Experimental Results
Avg. 6.6% latency with HW Since no adaptive routing algorithm Avg network latency penalty of 3.5% due to reactivation delay.
Sensitivity analysis Change mesh size Change data size
G. CHEN, F. Lİ, M. KANDEMIR
Reducing Energy Consumption of On-Chip Network Through A
Hybrid Compiler-Runtime Approach
Introduction
Compiler-runtime approach for reducing power consumption in the context of the NoC based chip multiprocessor (CMP) architectures.
Observation: Same communication patterns across the nodes of a mesh based CMP repeat themselves in successive iterations of a loop nest.
NoC Communication btw different blocks Expandible, reconfigurable to handle different comm. patterns Easily respond to fault conditions where one or more
connections are disabled.Critical issue
Power consumption of NoC Only responsible for up to 36% of overall power consumption
of a SoC.
Introduction
This paper investigates automated compiler support in reducing power consumption of an NoC based two-dimensional mesh architecture that uses a static routing algorithm.
NoC is exposed to a compiler through an interface.Goal: Let the compiler modify the application
source codeGoal: Manage power consumption of
communication links through voltage scaling.Two Stages
StartUp Phase: Gather link usage statistics during execution of first few iterations of given loop nest using hw support
Stable Phase: Use the collection with compiler to reduce link voltage levels (admissible delays) without affecting communication latency
Proposed Approach
Energy savings HW based
Hybrid24.9%38.1%
Performance overhead
HW basedHybrid 8.3% 2.1%
Link Voltage Scaling OptionsOn Chip Mesh Network
Each node consists of a processor, a switch and a small memory module.
Switch: 5 incoming and 5 outgoing portsEach incoming port contains a queue to
buffer msgWhen queue is full, outgoing port is blocked.Switch is used to read/set state of in/out
ports.
Link Voltage Scaling OptionsScaling Link Voltage
Link(i,j) to refer the link between sender Ni and receiver Nj .
Parallel program may consist of multiple connections, links between non-adjacent nodes, during its execution.
This connection may share some communication links.Packets transferred have to contend for the shared
links. Such contentions may increase transmission time of
packets. To take adv of this observation
Calculate voltage of communication links using link throughput- and link slack-based scaling methods to select lower voltage level.
Link Throughput Based Voltage Scaling
Data rate of a link (λij): max. number of data packets that can be transferred from outgoing to incoming ports of this link during a unit of time. Voltage of outgoing and incoming port
Throughput of a link (μij): number of packets forwarded from incoming port of this link to others during a unit period of time. Throughput of a link is limited to throughput of links to
which it is connected. Under heavy traffic, contentions of bottlenecks
can be severe.A link forwarding packets to a bottleneck link can
be congested. Bottleneck link could not accept packets fast enough.
Link Throughput Based Voltage Scaling
If a link is congested, its input queue will be filled up no more packets can flow into this link until at least one packet
in queue is forwarded to another link. its throughput can be much lower than its data rate.
Therefore, when congestion happens, While bandwidth of bottleneck links is fully utilized, Bandwidth of congested links are underutilized.
Reduce voltage of congested links to conserve energy without significantly degrading the overall performance .
Communication link being congested: queue associated with this link never becomes empty during this period.
Link Throughput Based Voltage Scaling
Strategy is as follows: if we find that during a given period, queue associated with link i,j never becomes empty and μij< λij then we reduce voltage of link to lowest level v such that
f(v) > μij , f(v):max data rate that a communication link can provide
Reducing data rate of a link that is not congested may hurt performance
So they apply this strategy only to congested links with specified properties.
Link Slack Based Voltage Scaling
For those whose queues may become empty during a given period of time, Opportunities to scale down voltage without
significantly degrading system performance
Hybrid ApproachHardware Support
Structure of a link from switch i to j
Control voltage of the circuit
Control voltage of outgoing port
Registers to count slack, cycles, emptiness of incoming port queues, etc.
Hybrid ApproachCompiler Support
Take a message passing based parallel code as input
Partition each loop nest in each parallel process code into set of voltage scaling regions
Set link voltages upon entering a voltage region
Start up: link usage info is collected individuallyStable: set suitable link voltage levels for different voltage regions.
Determining Voltage Scaling Regions
Voltage scaling region Loops Loop nests such that a communication pattern repeats
itself at every iteration
Partitioning algorithm To put as many loops or loop nests as possible in the
same voltage scaling region To minimize # link voltage changes (overheads)
Communication pattern if two loop nests have different comm patterns, they are not likely
to exercise comm links in the same way
Determining Voltage Scaling Regions
First compute communication patterns for loop nest L
if this pattern is not ε, treat entire loop as single region.
Otherwise, partition to inner loops with different comm. patterns
When partitioning, compute comm. pattern for each inner loop
Put adjacent inner loops with same pattern into same region.
Determining Voltage Scaling Regions
3,5,6,7,8 are inner loopsComm. Patterns are
calculated by counting # send&receive statements in their bodies.
Assume 3,5,6,7 have same pattern
2 encloses the same3 encloses the same
(adjacent to 2)1 and 4 have different
Code Transformation For Voltage Scaling
Start up Data structure called
sampling context is created
Before backward jump of each loop Collect link usage info Store in sampling context
Calculate suitable voltage levels
Stable For each region
Set voltage levels
Experiments
Code level msg optimizations
Additional compilation time overhead was below 30% for all benchmark codes
# voltage levels detected varied btw 3 and 28.
Static code size increase was less than 10%.
Dynamic instruction count is negligible
Custom network simulator
Experiments
Network energy savings 38.1% (hybrid) vs 24.9% (hw).
Hybrid approach Sets voltage levels
proactively. Change the voltage to
suitable level directly.Voltage scaling has costs.Avg performance
degradation 2.1% vs 8.3% respectively.
Hybrid scales voltage based on Max. Throughput #slack cycles for each link
Experiments
Due to modification in application codes Performance and energy
panalties in mesh nodes First table
Energy overheads : all dynamic and leakage overheads that occur in CPUs and memory components.
Avg overhead 1.13% Second table:
Normalized total energy consumption
Reduce by 10.7% vs 4.29%. Hybrid scheme performs
better when all overheads are accounted for
Experiments
Comparison with compiler-based approach and optimal scheme Hybrid takes runtime
communication behavior and catches the opportunities for link reuse
Achieve higher energy savings
Hybrid combined with compiler based link shutdown scheme
Close to optimal Extra network energy in
startup Selection of sub-optimal
voltage levels
QUESTİONS
Thank you.