pue: when a perfect metric becomes the enemy of good datacenter design
Post on 24-Jul-2016
218 Views
Preview:
DESCRIPTION
TRANSCRIPT
PUE: when a perfect metric
becomes the enemy of good
datacenter design
Cyrille Brisson – Janne Paananen
Eaton Electrical
Why PUE is a great, widely
adopted hygiene KPI
• PUE measures energy expanded for non-added value tasks
– Removing power defects that need not exist
– Removing heat, that is a computing by-product
• PUE clearly helps to improve efficiency
– Short-term by providing a baseline for improvement
– Long-term by pointing at obvious waste even if it is not
created as an industry benchmark
There are other, less good reasons
why PUE is so popular…
• It is the only reasonably well defined and accepted energy
efficiency measurement in the datacenter industry
• It fits both the organization of traditional enterprise datacenters
and the co-lo space by separating Facilities and IT efficiency
• It provides a simple number everyone believes they understand,
and that has sometimes turned into a marketing gimmick
… but overall it provides a useful
benchmark for newly built DC
• Improvement has accelerated now that PUE is widely
adopted, and new datacenters should aim for a PUE <1.2
–because they can!
– Good hygiene (containment, 5S…) and modern
cooling technologies have made <1.1 partial PUE
achievable in most western climates
– Modern critical power chain can achieve <1.1 partial
PUE at any load by leveraging DSPs, 3-level
inverters and smart reticulation
DSPs help increase efficiency
without trade-offs on resiliency…Real-time distributed monitoring of critical bus in UPS enables
synchronized behavior for time-critical situations –no SPF
… by enabling instant adaptation of
the power chain to the load levelDecentralized processing allows for instant reactions
The coming IEC 30134 will correct
some potential issues such as…• Confirming that the PUE has to be calculated as an average
value over 12 months, not a spot theoretical value
• Standardizing calculation to help spread best practices in
design and improvement
• Establishing clear criteria for using it as a benchmark for new
datacenters
… and there are potential further
ideas for improvementPUE should be calculated not only around the year, but
at different load levels (20, 40, 60, 80%) to anticipate
the full impact of more proportional IT hardware and
larger arrays under the same hypervisor / cloud stack.
But… it is possible to improve a
datacenter’s overall efficiency and
degrade PUE at the same time!
• Example #1: a DC equipped with monolithic UPS with
2-level converters degraded its PUE by:
– Consolidating loads and decommissioning servers
– Upgrading to more energy-proportional servers
• Example #2: a DC degraded its PUE by enabling heat
recovery by the district heating:
– Using a pump to drive the heat exchanger
between their water circuit and the district
heating’s
Chasing a super-low PUE can be
positively harmful to overall IT efficiency
• 2 recent trends can cause more harm than good if applied with
insufficient due diligence:
– Moving too much resilience from infrastructure to IT may cause
the increase of IT redundancy needed to maintain the SLAs of
critical applications to outstrip facility gains
– The latest increases in recommended operating temperatures
clearly threaten IT efficiency
Applications have varying uptime
and consistency requirements
• At the most critical end of the spectrum, some
applications (e.g. financial / payment records) run on
ACID databases with no RPO or RTO allowance
• Even some less-critical applications have SLAs that
will force IT redundancy to go up, as the probability
of failure of leaner infrastructure increases
• The cost of keeping multiple versions of applications
and data synchronized increases fast with the
number of instances (linked to probability of failure),
distance & latency
Eliminating power protection suits
only certain types of applications
• If you run a small number of applications not using
ACID databases over a large number of globally
distributed datacenters, you can probably eliminate
backup power layers and tolerate faults
• If you run stateful customer loads and rely on
traditional databases, “saving” on infrastructure could
cost you a lot in duplicated HW and bandwidth
High temperature reduces IT
efficiency in 3 ways
• Processors power leakage
• Fan power
• Vibrations of fans and InRow kill hard drives
efficiency by increasing the number of cycles
required to complete transactions
As Fans and InRow vibration level increases,
IO throughput performance drops and time
taken to complete workloads goes up
OLTP Workload
Throughput vs. Vibration
Time taken to update 10TB database
vs. Vibration
Conclusion
• A datacenter is a system deliver IT services at a
certain environmental cost, and must be considered
as a system
– Best practices show what is possible: <1.2
– Experimenting beyond best practices induces
trade-offs that must be carefully researched
• There is no such thing as a free lunch.
top related