406 bruce worthington_windows_server_power_efficiency_slideshare

207
CMG ‘08 INTERNATIONAL WINDOWS SERVER POWER EFFICIENCY Dr. Bruce Worthington Principal Software Development Lead Windows Server Performance Microsoft Corporation

Upload: bruce-worthington

Post on 09-Jan-2017

335 views

Category:

Software


5 download

TRANSCRIPT

Page 1: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WINDOWS SERVER POWER EFFICIENCY

Dr. Bruce WorthingtonPrincipal Software Development Lead

Windows Server PerformanceMicrosoft Corporation

Page 2: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Server Power Ground Rules TANSTAAFL: Everything is a trade-off

Performance, Power, Functionality, Capacity, Cost, Reliability, Availability, Manageability, Maintainability, Usability, Environmental Impact, Lifetime, Footprint, Security, Morale

Saving Power Power EfficiencyMore work at fixed power level, or Less power at fixed work level

Shifting component power efficiencies

Page 3: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 4: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Rising Cost of Ownership From 2000 to 2006

Computing performance: 25xEnergy efficiency: 8xUS electricity cost: 1.35xPower per $1K of server: 4xServer(+) world electricity: >2x

○ >1% of total world production

Datacenters use 2% of all US electricity

Page 5: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Scale: Kilowatts Megawatts Idle high-performance servers

50-80% of max power draw2-sockets ~ 250 W4-sockets ~ 500 W 8-sockets ~ 1000 W

25 15Krpm 2.5” disks + SAN = 3U~ 300/450 W (idle/active)

10,000 2-socket 1U servers  ~ 1-3 MW Datacenter “container” ~ 0.5 MW

~1500 servers + storage + infrastructure

Page 6: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Datacenter Energy Demand Data centers are energy intensive facilities

Server racks now designed to carry 25 kW loadSurging demand for data storageTypical facility ~ 1MW, can be > 20 MW (even 200 MW)Nationally 1.5% of US Electricity consumption in 2006

○ Doubling every 5 years Significant data center building boom,

Power and cooling constraints in existing facilitiesGrowing demand for compute cyclesGrowing computing performance Commoditized hardwareDeclining cost of computing

Page 7: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

15 MW Datacenter Monthly Costs“Good” (PUE=1.7) Internet-scale datacenter with DAS

Servers$3,000,000

Infrastructure$1,800,000

Power$1,000,000

3 yr server and 15 yr infrastructure amortization

Page 8: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Air Movement12%

Electricity Transformer/

UPS10%

Lighting, etc.3%

Cooling25%

IT Equipment50%

Source: EYP Mission Critical Facilities Inc., New York

Other than a common power source they are not connected.

Datacenter Costs Breakdown - 2

Page 9: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Datacenter Costs Breakdown - 1

Page 10: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Electricity Use by End-Use: 2000 - 2006

Page 11: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Environmental Impact Governments, businesses, and

organizations are trying to reduce the production of greenhouse gases

New EPA Energy Star mandates for enterprise server power efficiencies

Page 12: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background

ACPI Power StatesComponent Power

Windows Server 2003 2008 Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 13: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

ACPI Power State Definitions Performance states (P-states)

Dynamic voltage and frequency scalingMore than linear savings (cubic function)

Throttle states (T-states)Linear scaling of CPU clock

“Power” states (C-states)Low-power idle (CPU “sleep”) statesTurn off increasing amounts of silicon in package

System sleep states (S-states)On, standby, hibernate, offMS has not encouraged S-state support for servers

○ Changing with the increased focus on power

Page 14: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

ACPI Power State State Machine• For entire system

○ Global System States (G-States)○ Sleeping States (S-States)

Standby (S1), Hibernate (S2), … For processor only

Processor Performance States (P-States)○ Different processor frequency and

voltage Processor Throttling States (T-States)

○ Processor clock throttling to reduce processor utilization (and capacity)

Processor Power States (C-States)○ Processor is executing instructions

(C0)○ Processor is idle (C1, C2, …)

Other devices Device Power States (D-States)

○ Similar as C-States, but are for devices other than processors

G3 -Mech Off

Legacy

WakeEvent

G0 (S0) -Working

G1 - Sleeping

S4S3

S2S1

Power Failure/Power Off

G2 (S5) - Soft Off

BIOS Routine

C0

D0D1

D2D3Modem

D0D1

D2D3HDD

D0D1

D2D3

CDROM

C2C1

Cn

Performance State Px Throttling

C0

CPU

Page 15: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

ACPI Specification Versions WS03 complies with ACPI 2.0 WS08 complies with ACPI 3.0

Multiprocessor○ Dependent (ganged) and independent control○ Independent control w/ dependent behavior

(may transition or not based on other processors’ states)

MS has some ideas for ACPI 3.5

Page 16: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

ACPI Power State Dependencies Dependency Domains for ACPI power states (assumes S0)

Logical processors in the same domain should have the same C-state, P-state, or T-state

No dependence between a processor’s C-state domain, P-state domain, or T-state domain

OS control mechanisms based on dependency relationships Dependent control: Transitioning one processor to a new state

causes other processor(s) to transition to the same state Independent control: Transitioning one processor to a new P‑state

or T‑state does not affect other processors’ power states Independent control, dependent behavior: Transitioning one

processor to a new P‑state or T‑state may or may not transition other processor(s) to the same state based on the current state of the other processor()s that share this relationship

Page 17: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

P-States Windows processor performance states are

enabled by default Power policy allows flexible use of performance

statesValues for min / max processor speedExpressed as a percentage of maximum

processor frequencyWindows will round up to the nearest available state

Processor- and workload-dependent impactE.g., one system configuration was determined to have

insignificant perf impact from capping P-states at P1, but significant power savings

Page 18: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Demand Based Switching

Power policy will always use DBS between the range defined by min / max frequency

Full range or subset of available P-statesPolicy may be set to use only one performance state (min / max / intermediate)

Will not include linear clock throttle states

Page 19: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

P-State Policy: Frequencies - 1Example: Processor state power policy

Note: This is the default policy in WS08Intended to minimize performance hit

State Freq % Type0 2800 100 Performance1 2520 90 Performance2 2142 85 Performance3 1607 75 Performance4 964 60 Performance5 482 50 Performance

Maximum Processor State

Minimum Processor State

Page 20: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

P-State Policy Settings Example: Processor state power policy

Using a subset of available states Can use any contiguous range Some performance loss (may not be significant) unless P0 included (targets

minimal perf hit)

State Freq % Type0 2800 100 Performance1 2520 90 Performance2 2142 85 Performance3 1607 75 Performance4 964 60 Performance5 482 50 Performance

Maximum Processor State

Minimum Processor State

Page 21: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

P-State Policy: Frequencies - 3Example: Processor state power policy

Locking processor at one stateAny available state may be selectedSome performance loss (may not be significant) unless P0 is the state chosen (a la High Perf mode)

State Freq % Type0 2800 100 Performance1 2520 90 Performance2 2142 85 Performance3 1607 75 Performance4 964 60 Performance5 482 50 Performance

Min & Max Processor State

Page 22: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Setting P-State Parameters

Page 23: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Use Perfmon to Monitor P-State

Processor Performance / % of Max Frequency

Page 24: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

T-States

Linear clock throttle states (T-states)Compared to P-states, T-states do not save energy when performing identical workloadsHowever, throttle states may be useful for some scenarios (thermal overload)By default, WS08 uses T-states only if P-states are unavailable or in case of thermal overloadNo DBS: only the Maximum Processor State parameter is used

Page 25: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

T-State Policy: FrequenciesDefault use of linear throttle statesPerformance is directly affected by throttling

State Freq % Type0 2800 100 Performance1 2520 90 Performance2 2380 85 Performance3 2100 75 Performance4 1680 60 Performance5 1400 50 Performance6 1400 50 Throttle7 1120 40 Throttle8 840 30 Throttle9 560 20 Throttle

DBS Allowed

No DBS Allowed

Page 26: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Capping / Budgeting Enforcing per-server power limits (static or dynamic)

Calculations based on “plate rating” are often over-configured○ Stranded capacity

OS may not be able to respond fast enough to enforce hard limits when power spikes

Typically lower-power P-states attempted, then T-states engaged as necessary○ OS might not get a good estimate of the resulting effective frequency○ Monitoring applications and diagnostic tools may give incorrect data○ Opposite strategy from OS, where P-states move towards higher

performance modes when load increases Potentially huge (and potentially unexpected) hit in performance

right when it is most vital○ Sudden hardware throttling should be last resort

Page 27: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

C-States Although hardware may support more than 3 C-

states, Windows only utilizes a maximum of 3. But that doesn’t mean Windows only uses the first three hardware C-states:C1 = hardware C1C2 = hardware C?

○ Lowest-power consuming c-state with _CST of type 2C3 = hardware Cn

Wouldn’t expect P-state to affect C-state power, but it does on some processorsWS08R2 handles this by providing the capability to drop

to Pn before transitioning to C-state

Page 28: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Power Management - 1 CPUs have increasing number and ranges

of P-states and C-states Ballpark expectations per socket:

A few watts per P-stateTens of watts for lowest C-state(s)

Varying impact to server throughput and responsiveness

Mature, reliable technologySignificant deployments in mobile and desktops

Page 29: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Power Management - 2 No user intervention required Managed by the operating system Balances power savings with CPU

utilizationKernel selects target P-state based on

processor utilization history, Windows power policies, thread scheduler, system heuristics, node/socket/HW thread hierarchy

Transition processor to “sleep” C-states when idle (i.e., no thread to run on that processor)

Page 30: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Power Management - 3 Windows’ power policy includes various

parameters that influence how the kernel chooses target power states

Low voltage/power processors must be evaluated and targeted for the right scenarios Reduces OS power management flexibilityAdditional servers are required if the

workload is CPU-bottlenecked

Page 31: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hardware Support The correctness of all PPM tools and settings

relies on accurate hardware / firmware supportBroken BIOSes found in some previous-generation

serversReporting

○ Initialization of ACPI tables (e.g., power states, memory and I/O controller locations)

○ P-state and C-state monitoringControlling

○ PPM algorithm depends on correct historical information○ HW should comply/cooperate with OS power state requests

Page 32: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Power ManagementWorking together with OEMs/IHVs - 1 Hardware must support PPM capabilities ACPI namespace must describe capabilities and contain

processor objects On a processor there may be multiple independently-managed

power planes, potentially shared between components, such as: Cores, Caches, Memory Controllers, and Bus/Serial interface(s) to

other processors or IO components The performance impacts of turning off various pieces of silicon must

be carefully weighed and understood○ Snooping caches must be flushed before being shut down○ Memory or IO channels attached to a package must still be accessible by

other packages○ Bus/Serial interfaces must be running for active caches, memory, or IO○ Different components have different power-up delays from the various

power states they support

Page 33: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Collaborative Power Budgeting Ideal WS08R2 strategy Platform guarantees operation within the allocated

budget (HW Fail-safe) OS scales power/perf according to workload and

respects platform notifications New R2 Beta option: OS specifies target utilization

and HW selects P-states accordingly Otherwise, if the OS and HW are fighting for power

management control, both power and performance will suffer Hardware-directed power control settings are on by default in

some BIOSes

Page 34: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Servers Defaulting to Hardware-Controlled Power Mgmt Hardware-directed power control settings are on

by default in some BIOSesPlatform alters P-states, C-states, T-states, and/or D-

states without OS information○ One alternative is to have platform dynamically restrict the

available states and update the OS via ACPI (<= 2 Hz)May take over processor performance counters!

○ Obviously this is a big concern when using performance monitoring tools that utilize the on-CPU counters

Page 35: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background

ACPI Power StatesComponent Power

Windows Server 2003 2008 Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 36: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Component Power Metering

• Only a small set of server models provide the functionality of component power reporting

• Extra HW instrumentation (or fragile probing) is needed to monitor the component power usages for most platforms • Simplest alternative is to populate and

then take away any removable components and track the overall system power delta

Page 37: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Example Component Power Distribution #1 Idle 3-Year-Old 4-Socket Single-Core Server

Page 38: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Example Component Power Distribution #2Idle 4-Socket Quad-Core Server

Page 39: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Example Component Power Distribution #3

CPU (2)

46%

PCI Cards (3)17%

SCSI HDD (4)

12%

Mobo, 8GB RAM18%

Other7%

Processor power management represents the best opportunity today

Source: Intel Server Products Power Budget Analysis Toolhttp://www.intel.com/support/motherboards/server/sb/cs-016976.htm

Page 40: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Selecting Memory Components Lots of permutations for a given capacity

Family (e.g., DDR#)○ FB DIMMs draw more power

DIMM count○ Especially for FB, where bus may decrease frequency if enough DIMMs

Bus frequencies Ranks Density Data width Channel count

Low power memory must be evaluated and targeted for the right scenarios Additional servers are required if the workload is memory-

bottlenecked

Page 41: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Memory Power Savings Select the right type and number of DIMMs for the

workload Reduce memory accesses

Overall○ Smaller working set○ Better cache hit ratios○ Probably better performance, too

More memory power statesCompare server memory idle characteristics to mobile

memoryDeeper self-refresh states

○ Takes memory longer to come out of deeper states

Page 42: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

“Green Memory”

Tech Marking Datarate Capacity Density DQ RanksPower/DIMM

DDR2 PC2-5300 667Mhz 1GB 256Mb x4 DR 18.1WDDR2 PC2-5300 667Mhz 1GB 256Mb x8 QR 18.6WDDR2 PC2-5300 667Mhz 1GB 512Mb x4 SR 7.6WDDR2 PC2-5300 667Mhz 1GB 512Mb x8 DR 7.8WDDR2 ECC 667Mhz 1GB 1Gb x16 DR 6.1WDDR2 No ECC 667Mhz 1GB 1Gb x16 DR 5.5W

No "by 16" part with 4Gb densityDDR2 PC2-5300 667Mhz 4GB 1Gb x4 DR 14.0WDDR2 PC2-5300 667Mhz 4GB 1Gb x8 QR 14.4WDDR2 PC2-5300 667Mhz 4GB 2Gb x4 SR 8.6WDDR2 PC2-5300 667Mhz 4GB 2Gb x8 DR 8.8W

Page 43: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Networking Power NIC idle power (examples)

100 Mb 1 W1Gb 5 WQuad 1Gb 5-9 W10Gb 10-15 WQuad 10Gb 17 W

Don’t forget network switch power Windows Networking Optimizations

NDIS DPC timer period Wake-on-LAN (see content in WinHEC 2008)Low Power on Network Disconnect

Page 44: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hard Disk Power Decreasing radius

Cubic power relationship (Power~Radius3) 3.5” 15K RPM drive = ~12/18 W (idle/active) 2.5” 15K RPM drive = ~6/9 W (idle/active)

Decreasing rotational speed Quintic power relationship (Power~RPM5) 15K RPM = 2 ms avg rotational delay (serial workload) 10K RPM = 3 ms avg (~3-4 W idle) 7.2K RPM = 4 ms avg (may have slower seek as well)

Frequently spinning down enterprise drives not advisable (yet)

Page 45: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Storage Controller Power HBA / storage connection interface

E.g., PCI-X and PCI-e cards 5-8W idle

Array ControllerE.g., small SAN ctlr (2U) = 200/300 W

(idle/active in direct attached mode)

Disk InterfaceSCSI: 80 160 320 GB/sFC: 1 2 4 8 Gb/sSAS/SATA: 1.5 3.0 6.0Gb/s

Page 46: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

PCI-Express Power Management Support for Active State Power

Management (ASPM)a k a, Link State Power ManagementIn-box power policy for ASPM stateRequires OS control of PCI Express

featuresAvailable white paper

Page 47: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Supply Efficiency Power Factor: phase delta between input

voltage and currentActive Power Factor Correction (PFC)

Ratio of input:output power (ACDC)Entropy means 100% efficiency is unobtainableDefault supplies at 70%; new models up to 85%

Previous power supplies were often optimized for high workload levels, but most servers run at 5-20% of capacity (for now)

Decreases power without decreasing perf

Page 48: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Supply Efficiency“80 Plus” Requirement for Energy Star (July ‘08) 80% minimum efficiency at 20%, 50%,

and 100% of rated outputPrevious power supplies often optimized for

high loads, but most servers run at 5-20% Minimum power factor of 0.9 or greater at

100% of rated output Decrease power without decreasing perf

Page 49: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Supply Waste PowerEfficiency Output

PowerRequired

Input PowerWaste Power

Waste Power Cost per Annum

70 (default) 500 W 714 W 214W $183.15

80 (near 80 plus Bronze) 500 W 625W 125W $106.98

85 (80 plus silver) 500 W 588 88W $75.31

90(above 80 plus gold) 500 W 555 55W $47.07

Page 50: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Fan Power Fans in some 1U servers consume 15-

20% of overall system power Fixed vs. variable-speed fans Decrease power without decreasing perf

Page 51: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008

OverviewServer Power Measurements

Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 52: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2003 ACPI 2.0 compliant Windows processor driver required for

specific CPU make/model Requires selecting appropriate power policy Each system power policy includes a

processor throttling policyHighest (default), lowest, or full range of P-states

OEMs or server administrators may create additional power plans

Page 53: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2008 - 1 ACPI 2.0 and 3.0 compliant Native OS support for PPM on

multiprocessor systems Default power settings refined for each

release (including WS08R2) Windows Server 2008 & SP2

Simplified configuration modelGroup Policy over power settingsPower management enabled by default

(“Balanced Mode”)

Page 54: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Plans

Power Plans Min P-state Max P-stateBalanced 5% 100%

Power Saver 5% 50%

High Performance 100% 100%

Page 55: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2008 - 2 T-states used only when no P-states available Power management parameterization for

improved flexibility of P- and T-state algorithmsAdditional tunings available for OEMs to customize to

processor, chipset, platform, role, etc. Improved C3 support Very hard to generalize, but 2-10%

improvement in power efficiency observed at mid-to-low utilization levels (vs. 2003)

Page 56: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Power ManagementWindows Server Releases Fully supported by WS03, WS08, and

WS08R2 Feature parity with Windows client

operating systemsFor example, WS08 has full support for:

○ ACPI 2.0, 3.0 processor objects, Notify() events

○ Power policy for tuning Operating System (OS) target state algorithms

○ Deep idle C-states

Page 57: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Default Power Parameters - 1

* = May not appear in Control Panel options by default

PPM parameters

Non-PPM parameters

Page 58: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Default Power Parameters - 2

How frequent

Change P-state or not

How to change

Page 59: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Default Power Parameters - 3

Entry idle, promote only

Deep idle, demote only

Page 60: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Idle Improvement Techniques Shut down unnecessary services,

applications, roles, devices, drivers Avoid polling and spinning in tight loops Avoid high-res periodic timers (<10 ms)

Page 61: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008

OverviewServer Power Measurements

Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 62: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Measuring Power Few existing Windows servers are equipped with

comprehensive power metering capabilitiesIn the future, servers are likely to have onboard power

meters○ AC power (into the power supply)○ DC power (out of the power supply)○ For individual components (CPU, RAM, IO, fans, disks, …)

The Windows Server Performance team has resorted to two strategies:Metering at the wall (AC)Directly probing specially manufactured server

motherboards (solder and data acquisition)

Page 63: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Measuring Power EfficiencyWhich Watts/Amps to measure? Total server (wall)

power External power

Network switches/hubsStorage (disks, array

controllers, SANs)Power distribution and

conditioningHVAC

Internal component power Processor package

○ Threads, cores, caches, memory controllers, cross-package interconnect controllers, IO controllers (e.g., PCI-E)

Memory (controllers, DIMMs, ranks, banks)

Chipsets (north bridge, south bridge, IO controllers)

Power supplies○ AC in, multiple DC out○ Redundant (active/active,

active/passive) IO (network, storage, video, USB)

○ Embedded components and expansion cards

Fans and other internal misc.

Page 64: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Measuring Power Efficiency Traditional performance benchmarks optimize for high

throughput or low response time by using all resources The load line approach tracks power use as load varies

Pick a power point and see how much load can be handled Pick a load point and see how much power is required

Workload breadth Database, web server, file server, etc.

MS uses SPECpower (a la SPECjbb) and is adding customer-accepted performance benchmarks TPC-C/E/H, SpecWeb, NetBench, SAP, SPEC, … Semi-internal: FSCT, LCW2, Web Fundamentals, TermSrv,

PerfGates, …

Page 65: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Measuring Power EfficiencyWhich workloads to test? Workload breadth

Database, web server, file server, etc.○ Need to prioritize based on potential for power savings and for broadest

customer coverage Each has unique “work accomplished” metrics (e.g., ops per second)

Industry standard workloads, such as SPEC and TPC Custom workloads designed to test power scenarios Microsoft is currently using SPECpower and customer-

accepted performance benchmarks to convey power efficiency TPC-C/E/H, SpecWeb, NetBench, SAP, SPEC, … Semi-internal: FSCT, LCW2, Web Fundamentals, TermSrv,

PerfGates, …

Page 66: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Industry Standard Workloads SPEC

SPECpower is the only standardized benchmark at this point○ Single workload defined to date

Order processing for a wholesale supplier running typical Java business applications Basically SPECjbb with some changes Minimal I/O and kernel time

○ Other SPEC benchmarks could have a “power” version, and each one may or may not be modified from the “perf” version

TPC Could add a power metric to each of their existing benchmarks, but details

are still being worked out○ What is server power vs. storage power?○ What needs to be installed in the audited server?

I suspect they will stick to the same approach used for pricing, in that the system has to be available as a purchasable product

What about the “price” of power?○ Etc.

Page 67: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Measuring Power EfficiencyWindows Server Performance Lab Methodology for obtaining power load line data for

TPC-C, TPC-E, FSCT, and Web Fundamentals have been demonstratedBenchmark loads varied by throttling number of active usersMultiple workloads tested in Hyper-V environment

SPECpower has been successfully tuned Data has been gathered on 2-, 4-, and 8-socket

systems with various processorsWall-socket power measurementsComponent power measurement by brute force (device

extraction)

Page 68: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Varying Load Levels

68

Iteration SPECpower(Reduce load)

TPC-E(Reduce users)

FSCT(Increase users)

1 100% load 100% of max users 0 users2 90% load ~90% of max users 10% of max users3 80% load ~80% of max users 20% of max users4 70% load ~70% of max users 30% of max users5 60% load ~60% of max users 40% of max users6 50% load ~50% of max users 50% of max users7 40% load ~40% of max users 60% of max users8 30% load ~30% of max users 70% of max users9 20% load ~20% of max users 80% of max users

10 10% load ~10% of max users 90% of max users11 0% load 0 users 100% of max users

Similar strategy used for Web Fundamentals

Page 69: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Testbed

Server Storage(Database Workloads)

Clients /Controller

Page 70: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

HW and SW Test Configurations Sample platforms

2-socket and 4-socket quad-core 8-socket dual-core x64 (AMD and Intel); ia64

Hardware- and software-controlled power management modes

WS03, WS08, WS08SP2 (prerelease), and WS08R2 (prerelease)

Windows power schemes Balanced, Higher Performance, Power Saver, … P-State settings and heuristics C-State settings and heuristics Parameterized power management optimizations

○ E.G., core parking, tick skipping

Page 71: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower: WS03 and WS08

0% 20% 40% 60% 80% 100%60%

70%

80%

90%

100%W2K3.SP1 W2K8.RTM W2K8.SP2

Workload (% of Max ssj_opts)

Pow

er (%

of M

ax W

atts

)

2 sockets, 8 cores total

Page 72: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower & FSCT: WS03 and WS08

SPECpower throughput and power at different workload levels

on a 4-socket quad-core system

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%60%

65%

70%

75%

80%

85%

90%

95%

100%

Windows Server 2003 Windows Server 2008

Workload (% of maximum throughput)

Pow

er (

% o

f max

imum

wat

ts)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%60%

65%

70%

75%

80%

85%

90%

95%

100%

Windows Server 2003 Windows Server 2008

Workload (% of maximum throughput)

Pow

er (%

of M

axim

um w

atts

)

FSCT throughput and power at different workload levels

on a 2-socket dual-core system

Page 73: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

TPC-E: WS03 and WS08TPC-E power usage at varying workload levels

TPC-E power efficiency (tpsE/Watt) at varying workload levels

0% 20% 40% 60% 80%70%

75%

80%

85%

90%

95%

100%

Windows Server 2003 Windows Server 2008

Workload (% of maximum tpsE)

Wat

ts (%

of m

axim

um)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%0%

20%

40%

60%

80%

100%

Windows Server 2003 Windows Server 2008

Workload (% of maximum tpsE)

tpsE

/Wat

t (%

of m

axim

um)

Page 74: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

OOB Windows Server 2008

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

power ssj_ops per Watt

Workload (% of max ssj_ops)

Pow

er (%

of M

axim

um)

ssj_

ops

per W

att (

% o

f Max

imum

)

SPECpower throughput (ssj_ops) and power at varying workload levels

Processor utilization and frequency as SPECpower workload decreases over time

0 4 8 1216202428323640444852566064680%

20%

40%

60%

80%

100%

70%

75%

80%

85%

90%

95%

100%

Processor Utilization Processor Frequency

Time (minutes) with decreasing workload

Ave

rage

Pro

cess

or U

tiliz

atio

n

Proc

esso

r Fre

quen

cy (%

of M

axim

um)

Time (min)

Page 75: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

TPC-E: Windows Server 2008

30 39 48 57 66 75 84 93 102 111 120 129 1380%

20%

40%

60%

80%

100%

120%

Distribution of P-States as workload decreases over time

Time (minutes)

Cum

ulat

ive

P-St

ate

Dis

tribu

tion

P0

P1

P2P3

P4

C1

Page 76: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS03/IIS6 and WS08/IIS7

4 quad-core CPUs, 16 GB, RAID-5 arrayMeasured Projected

Server Config

Active Clients

Avg Watts kWh / yr Cost Kg of CO2

WS03, IIS6 0 468 4100 375 3190

WS08, IIS7 0 457 4000 357 3110

WS03, IIS6 20 537 4700 430 3660

WS08, IIS7 20 500 4380 401 3410

Page 77: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 78: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server Core Energy Vision Dynamic Data Center

Coordination across all data center components to scale infrastructure and computing according to business needs

Scalable Node: Server power efficiencyLow idle power consumptionPower consumption should scale with load

Page 79: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Dynamic Data Center Holistic approach spanning all infrastructure

not just the computing nodes Reducing waste and optimizing performance

Scaling and migrating workloadsCoordination with power and cooling systemsWatch out for over-eager workload consolidation

or low-power component acquisition Building platform and management

infrastructure

Page 80: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Dynamic Data Center – The Problem Addressing energy consumption in the data center

requires a holistic approach spanning all infrastructure not just the computing nodes

Many factors affect how a data center consumes energy Hardware, workload, time of day/week/year, locality, etc. Data centers are generally statically configured for peak load

Tremendous opportunities for reducing waste and optimizing performance exist Scaling and migrating workloads across groups of machines Coordination with power and cooling systems Opportunities also exist for unexpected reduction in computing

capacity through over-eager workload consolidation or low-power component acquisition without proper planning / testing

Page 81: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Dynamic Data Center – The Vision Enable the management of aggregate

servers in conjunction with data center infrastructure

Deliver this through building platformand management infrastructurePower metering and budgetingVirtualization and workload migrationStandards-based management technologiesCoordination between in-band and out-of-band

management systems

Page 82: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Scalable Node Today power consumption does not scale in line

with server utilizationTypical commodity servers consume 50-70% of the

maximum power when completely idleBasic approaches:

○ Increase server utilization via virtualization○ Reduce power when full performance not needed○ Power down / put to sleep excess servers

Work with partners to provide the best power and performance by managing the system efficiently

Windows power management improvements

Page 83: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Scalable Node – The Problem Today power consumption does not scale in line

with server utilizationTypical commodity servers consume 50-70% of the

maximum power when completely idle○ Idle servers have low efficiency due to high idle power○ Efficiency rises with utilization due to idle power amortization

Tremendous opportunities exist for reducing energy needs○ Reduce power when full performance is not required○ Leverage virtualization solutions to increase server utilization○ Power down servers when they are not needed

Page 84: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Scalable Node – The Vision Work with partners to provide the best power

and performance by managing the system efficiently

Deliver this through improvements to Windows Power Management Build on existing infrastructure and extend

Windows value Enhancements to processor power management Focus on idle and low-to-medium workload levels Support for device performance states

Page 85: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2008 R2 - 1 Refined “Balanced Mode” defaults to optimize

power efficiency Takes advantage of advances in server platform

hardware (e.g., powering down individual cores or sockets)

Configurable power settings for new features (e.g., core parking)

P-state and C-state selection algorithms updated Increased support for joint OS/HW power

management

Page 86: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2008 R2 - 2 Simplified configuration model Group Policy control over all power settings Rich command line interface and refined UI

elements In-band WMI power metering and budgeting

support Remote manageability of power policy via

WMI Additional qualification logo to indicate

enhanced power management support

Page 87: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Device Power Management Extensible power policy infrastructure

Allows easy incorporation of power management-enabled devices○ Device power settings integrate with Windows

system power policy○ Device power settings can appear in

Advanced power UI○ Rich notification support

Allows for true OEM power management innovation and value

Page 88: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Enhanced Power Management Logo Additional Qualification logo for

“Enhanced Power Management” that indicates support for the following:Processor power management through

WindowsPower metering and budgetingPower On/Off via WS-Management

(SMASH)

Page 89: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 2008 P-State Parameters

Balanced Mode Settings WS08 R2 Pre-

Beta WS08R2

Time Check 100 ms 100 ms 50 msIncrease Time 100 ms 100 ms 50 msDecrease Time 300 ms 100 ms 50 msIncrease Percentage 30% 70% 80%Decrease Percentage 50% 30% 70%Domain Accounting Policy 0 (On) Always Off Always Off

Increase Policy IDEAL (0) IDEAL (0) SINGLE (1)Decrease Policy SINGLE

(1) SINGLE (1) IDEAL (0)

Page 90: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Optimized for Low-to-Medium Loads Even though 100% utilization may have the

highest power efficiency, few servers run at full capacityServers at maximum utilization provide less

opportunities for power optimizations In the short term, targeting low utilization

servers will provide most benefit In medium term, targeting medium utilization

servers will provide increased benefitE.g, consolidation and virtualization will increase

average utilization levels

Page 91: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 92: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Get Idle; Stay Idle Shut down unnecessary services, applications, roles,

devices, drivers Avoid polling and spinning in tight loops Avoid high-res periodic timers (<10 ms) Timer Coalescing Intelligent Timer Tick Distribution (ITTD) Use NUMA-based affinity for threads and interrupts

Thread (via APIs and tools): soft (IdealProc), hard (affinity mask) Interrupts (via IntPolicy.exe)

Idle improvements extend to Hyper-V Significant reduction in platform interrupt activity Enables power savings and greater scalability

Page 93: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Timer Coalescing Platform energy efficiency can be improved by extending

idle periodsNew timer coalescing API enables callers to specify a tolerance for

due timeEnables the kernel to expire multiple timers at the same time

Extensions should integrate with WS08R2 API/DDI

Timer tick15.6 ms

Periodic Timer Events

Windows 7

Vista

Page 94: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Intelligent Timer Tick Distribution (Tick Skipping) Extend processor sleep states by not waking

the CPU unnecessarily CPU 0 handles the periodic system timer tick;

other processors are signaled as necessary Non-timer interrupts will still wake sleeping

processors Not available on IA64 Only enabled on systems with more C-states

than just C1

Page 95: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Background Process Management Background activity on the macro scale (minutes, hours) is

also important for powerE.g., disk defragmentation, AV scansPrevents low-power idle and sleep modesWill collapsing multiple background activities result in a

significantly heavier load during that interval and thus potentially impede concurrent foreground activity?

Unified Background Process Manager (UBPM)New WS08R2 infrastructureDrives scheduling of services and scheduled tasksTransparent to users, IT pros, and existing APIsEnables trigger-starting servicesDelivers usage data and metrics to Microsoft via CEIP

Page 96: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

UBPM: Trigger-Start Services Many services configured to Autostart and wait for rare

events UBPM enables Trigger-Start services based on

environmental changesDevice arrival/removal, IP address change, domain join, etc.Examples

○ Bluetooth service is started only if a Bluetooth radio is currently attached○ BitLocker encryption service started only when new volumes detected

ISV Call to ActionLeverage trigger-start capability for value-add servicesValidate performance impact with XPerf tools

○ Performance impact can be positive or negative

Page 97: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Coordinated Processor Clocking Control New processor performance state interface

described via ACPI Feature enables OS and HW platform

coordination of processor power managementPlatform is in direct control of T-states and P-statesOS dynamically specifies processor performance

requirements on per-processor basis as a percentage of maximum frequency

Platform is responsible for delivering requested performance○ In some cases, like a power budget condition, the

platform may underdeliver, but must report this

Page 98: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 99: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor Core Parking This is a Windows scheduler optimization, not HW! Goals

Save power on multi-processor systems by dynamically scaling number of active cores to match workload

Drop parked cores into deepest C-states

Approach Use historical information to predict future workload Calculate number of cores needed Heuristically select the “unparked” cores

MonitoringPerfmon and ETW

Page 100: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Processor (Logical) Core Parking Logical core = HW thread (e.g., Intel®

Hyperthreading) Extension of Windows’ processor performance

state engineConfigurable via power policy settings

Parking may reduce performance, depending on the parameter settings, by reducing OS responsiveness to rising load levels

Parking could improve performance by concentrating work onto a smaller number of cores

Page 101: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Selecting Cores to Park - 1 WS08R2 (Beta) approach:

Leave one logical core unparked per NUMA node Other possible approaches, including

customizable minimum unparked entitiesPark entire packages at oncePark logical cores individually, regardless of packagesLeave one logical core unparked per socketLeave one logical core unparked per physical core

Affinitized activity does tend to unpark logical cores that must be used (selection heuristic)Beta tracks affinitized threads, not DPCs / Interrupts

Page 102: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Selecting Cores to Park - 2 Parking algorithm takes many inputs. At a minimum:

Time since the last parking decision was madeAverage frequencies of each core over the last time intervalAverage CPU “utilization” over the last time interval

Possible additional inputs depending on parameter setting and final WS08R2 refinements:

○ Power state domains (i.e., groups of associated cores)○ Current processor P-States○ P-State change rate policies (SINGLE, ROCKET, IDEAL)○ Affinitized DPCs / Interrupts○ Time spent in affinitized activity○ More comprehensive or longer historical information○ More system component topology information

Page 103: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 104: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power Management Full P-state/C-state management already

integrated between Windows root partition and Hyper-V v1 (WS08)

Enlightenments added in Hyper-V v2 (WS08R2) Hypervisor delivers child clocks without requiring

root interaction, plus Intelligent Timer Tick Distribution (to children)

Core parking enabled for all partitions

Page 105: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Web Fundamentals Dynamic: WS08

0% 5000% 10000%250

270

290

310

330

350

370

0

5000

10000

15000

20000

25000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (R

eqs

/ Sec

)

For these experiments, the highest system utilization under WF workload is ~80%. This issue has been subsequently resolved.

0% 5000% 10000%250

270

290

310

330

350

370

0

5000

10000

15000

20000

25000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (

Reqs

/ se

c)

Adding load to each guest Adding guests

Page 106: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower: WS08

0%20

00%

4000

%60

00%

8000

%

1000

0%250270290310330350370390

0

50,000

100,000

150,000

200,000

250,000

300,000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

0%20

00%

4000

%60

00%

8000

%

1000

0%250270290310330350370390

0

50,000

100,000

150,000

200,000

250,000

300,000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

Adding load to each guest Adding guests

Page 107: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower + WF (WS08)SPECpower throughput and server power

usage versus total system utilizationPower usage for various

throughput levels

2000

%40

00%

6000

%80

00%

1000

0%250

270

290

310

330

350

370

390

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Watts Throughput

System Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

0% 20% 40% 60% 80% 100%60%65%70%75%80%85%90%95%

100%

Workload (% of maximum throughput)

Pow

er (%

of W

atts

)

• 4 Guests running WF (4940 requests/sec)• ~25% system utilization; ~35% guest virtual processor utilization

• 4 Guests running SPECpower (similar efficiency as single workload)

Page 108: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 109: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting In the future, servers are likely to have onboard

power metersAC power (into the power supply)DC power (out of the power supply)For components (CPU, RAM, IO, fans, disks, …)

WS08R2 provides the capability to monitor such meters, as well as communicate with power management logic, through standard Windows and ACPI interfaces

Power budget information is reported to OSOptional support for configuring the budget from within

Windows

Page 110: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting

System Center

.

.

.

WMI ConsumersWMI

Namespaceroot\cimv2\power

Power Supply classPower Meter classPower Meter Events

User-mode Power Service

Power WMI

providers

Standard Windows IOCTL interface

In-box ACPI-based

implementation

Vendors provide ACPI code in

firmware

Other vendor specific

implementations…

Implemented in WS08R2

BMC hardware

Admin scripts

Hardware

Management tools

Page 111: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting WS08R2 introduces the ability to report power consumption and

budgeting information Server platform reports this in-band to the OS via ACPI No additional drivers are required or HW changes, only platform support

Power information is exposed via WMI Adheres to the DMTF Power Supply Profile v1.01

Power budget information is reported to the OS Optional support for configuring the budget from within Windows

Extendable to enable per-device metering WDM driver interface available

Design goals Standard hardware and software interfaces Native infrastructure, easily extendable Leverages existing platform technology

Page 112: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – Usage Statistical/inventory/auditing Data center can monitor power consumption

across nodes Administrator can write scripts to control

power policies and receive power condition events

Model can be extended to per-device meters Another set of metrics for virtualization and

consolidation

Page 113: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – WDM Standard Windows driver IOCTL interface Event model based on pending IO

requests (IRPs) Two separate device interfaces Consumed by the WMI providers An alternative to the ACPI implementation Future direction – potentially consumed

by the kernel power manager Documented on MSDN

Page 114: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI Rationale

Works as the abstraction layer to the underlying platform technology (IPMI, WSMAN, etc.)

Scales across different platformsDoes not require special driversRequires only firmware updates

Currently being proposed to the ACPI 4.0 specification

Delegate tasks to the BMC (e.g., rolling average calculation, polling for events, etc.)

Page 115: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI Power supply device

Extends the current power source deviceControl method to publish capabilities

Power meter deviceSimilar to control method for batteriesA set of control methods to get capabilities

and set configuration parameters, trip points,

and configure hardware enforced limitsEvent notification via Notify codes

Page 116: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI WS08R2 will provide

In-box driver to support power meter device(s) described in ACPI

In-box IPMI operation region handler as part of the Microsoft IPMI driver – allowing ACPI control methods to communicate with IPMI using the KCS protocol○ Format similar to the SMBUS OpRegion○ 3rd-party IPMI drivers can register OpRegion

handler for other IPMI protocol(s)

○ Also proposed to ACPI 4.0 specification

Page 117: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2

Server Energy VisionIdle Power OptimizationsCore ParkingHyper-V (v2)Power Metering and BudgetingSSD

Power Diagnostics and Control Summary

Page 118: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Enables Improved Endurance for SSD Technology SSD can identify itself differently from HDD in ATA as

defined through ATA8-ACS Identify Word 217: Nominal media rotation rate

Reporting non-rotating media will allow WS08R2to set Defrag off as default; improving device endurance by reducing writes

Page 119: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Enables Optimization for SSD Technology Microsoft implementation of “Trim” feature

NTFS will send down delete notification to the device supporting “trim”○ File system operations: Format, Delete, Truncate,

Compression○ OS internal processes: e.g., Snapshot, Volume Manager

Three optimization opportunities for the device Enhancing device wear leveling by eliminating merge

operation for all deleted data blocksMaking early garbage collection possible for fast write Keeping device’s unused storage area as high as

possible; more room for device wear leveling.

Page 120: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2 Power Diagnostics and Control

Perfmon/ResmonPowertstPowercfg

Summary

Page 121: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Check Processor ACPI States:System Event ID 4

Page 122: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Check Power State Settings

Kernel Debugger !ppmperf

Provides P-state and T-state information

!ppmidleProvides C-state information

Page 123: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Monitoring Power Status - 1 System Event Log: ID 4 Perfmon/Logman

Processor○ Provide average C-state information

% C1/2/3 Time and C1/2/3 Transactions/secProcessor Information

○ Parking status Processor Performance

○ Only present if P-states are exposed○ Provide current P-state information (e.g., avg freq)

Resource MonitorCPU % Max Frequency average and graph

Page 124: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Perfmon: Processor Frequency

Page 125: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Perfmon: Processor Frequency vs. Utilization

Page 126: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Resmon: Processor Frequency

Page 127: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Monitoring Power Status - 2 ETW tracing (Windows Perf Tool Kit)

Xperf –on power Pwrtest.exe

Logs use of P-, T-, and C-statesPwrtest /ppm

○ Sampling P-state and C-state performancePwrtest /ppm /live

○ Event driven logging for all the P-state and C-state transactions

Page 128: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Pwrtest.exe /info:ppmC:\Program Files\Microsoft PwrTest>

pwrtest /info:ppm

PROCESSOR_POWER_INFORMATION

CPU Number = 0

MaxMhz = xxxx

CurrentMhz = yyyy

MhzLimit = zzzz

MaxIdleState = M

CurrentIdleState= N InstanceName: CPU Model X 

(continued)Processor Performance States PerfStates: Max Transition Latency: xx us Number of States: yy

State Speed (Mhz) Type 0aaaa (100%) Perf 1bbbb ( ss%) Perf 2cccc ( tt%) Perf 3dddd ( uu%)Throttle 4eeee ( vv%) Throttle 5 ffff ( ww%) Throttle

Page 129: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Pwrtest.exe in Logging Mode - 1C:\Program Files\Microsoft PwrTest> pwrtest /ppm Elapsed Idle C1 C2 C3 P- Freq Freq Perf/Cpu [ms] [%] [%] [%] [%] State [%] [MHz] Throttle--- ------- ---- --- --- --- ----- ---- ----- -------- 0 5007 98 0 73 26 2 54 1000 P 1 5007 99 0 93 6 2 54 1000 P 0 10014 97 0 72 27 2 54 1000 P 1 10014 97 0 91 8 2 54 1000 P 0 15021 88 1 0 0 2 54 1000 P 1 15021 89 1 0 0 2 54 1000 P 0 20028 99 0 0 100 2 54 1000 P

Page 130: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Pwrtest.exe in Logging Mode - 2

C:\Program Files\Microsoft PwrTest> pwrtest /ppm /live Waiting for PPM Events. Press 'Ctrl-C' to quit...Timestamp Proc# Event Information-------------------------------------------------------------------------------21:27:41.133 0 Idle State Demotion (Old:2, New:1, Affinity:0x1)21:27:41.133 1 Idle State Demotion (Old:2, New:1, Affinity:0x2)21:27:41.196 1 Perf State Change (State:0, Speed:1833 Mhz)21:27:41.196 1 Domain Perf State Change (State:0, Speed:1833 Mhz, Affinity:0x3)21:27:41.196 0 Idle State Demotion (Old:1, New:0, Affinity:0x1)

Page 131: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Setting P-State Parameters

Page 132: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Controls: Powercfg.exe Configure power settings within a specific

power scheme (WS03+) WS08R2: Detect common energy

efficiency problems (via /ENERGY flag)USB device selective suspendProcessor Power Management (PPM) Inefficient power policy settingsPlatform timer resolutionPlatform firmware problems…and more

Page 133: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Powercfg.exe ExampleConfigure power setting within a specific power scheme

Set AC, DC values for individual settingsEvery power setting belongs to a Subgroup-setdcvalueindex used for battery scenario

C:\> powercfg.exe –setacvalueindex <SCHEME> <SUBGROUP> <SETTING> <VALUE>

C:\> powercfg.exe –setacvalueindex SCHEME_BALANCED SUB_SLEEP STANDBYIDLE 0

Page 134: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Efficiency Diagnostics “Powercfg /ENERGY” to start tracing

Close open applications and documents first Inbox with WS08R2 only

Leverages new inbox ETW instrumentationAdvanced users can run utility and view HTML output

Automatically executed when the system is idle [Win7]Reports data to Microsoft via Customer Experience Improvement

Program (CEIP) Attend COR-C633 Microsoft Tools for Energy Efficiency

Diagnostics for demo and details

Page 135: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Efficiency DiagnosticsDetected problems

Problem Area Data Collected Warning

ThresholdError

Threshold

USB Device Selective Suspend

Individual device suspend transitions% of time device was in suspend state

< 80% suspend time

< 50% suspend time

Power Policy

Settings

Idle timeouts (dim, display, sleep)PPM configurationPower plan personality802.11 Wireless Power Save

Idle timeouts < EnergyStar 4.0 Recommendations

Idle timeouts disabled

Processor Utilization

Overall utilizationPer-process utilization (any process over .1%)Top 3 module utilization in each process

Total utilization >2%

Total utilization > 4%

Page 136: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Efficiency DiagnosticsDetected problems

Problem Area Data Collected Warning

ThresholdError

Threshold

Timer Resolution Requests

Current system timer interrupt period (e.g., 15.6ms)Applications with outstanding timer requests, request amount

None Timer interrupt period < 15.6ms

Platform Capabilitie

sFirmware validation problemsPCI Express ASPM status None

If any capability is disabled or missing

Page 137: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Lab Issues: Processor Utilization is Based on Non-Idle Wall Time Idle == idle loop or HALT It doesn’t take frequency into account, so 100% CPU

utilization could be at P0 or at Pn There may actually be more performance on the table

Idle time will include the time taken to return from C-states (HALT), which could be microseconds

CPU utilization will include cache warm-up effects if the cache has been flushed to reach the deepest C-states

CPU utilization will include latencies caused by remote memory being in low-power states In particular, AMD and future Intel processors where memory

is socket-attached

Page 138: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Lab Issues: OS vs. HW C-States Only three C-states selected by the OS:

C1: C1 in HWC2: lowest power “type 2” C-state reported

by HWC3: Cn in HW

Perfmon shows OS perspective of C-states

Page 139: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Outline Motivation Background Windows Server 2003 2008 Windows Server 2008 R2 Power Diagnostics and Control Summary

Page 140: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Summary Windows Server 2008 and 2008 R2 deliver

real energy savings for the data center New WS08R2 features deliver enhanced

power efficiency and better manageabilityImprovements to idle and low-to-medium

workload operating efficiencyManagement of power policy via WMIPower metering support provides energy

consumption information through Windows

Page 141: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Future Work Example:NonVolatile Memory (NVM) Solid State Disk (current server usage) Potential additional layer(s) in memory hierarchy

Cache (a la ReadyBoost) DRAM complement

Very low power when idle But low-power DRAM may narrow the gap significantly

Poor performance of random writes Could be improved by coalescing and remapping writes

Block orientation Difficult to use as DRAM complement

Limited lifetime of Flash cells Future NVM technologies may improve on this

Page 142: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Call to Action - 1 Make sure any reduction in server capabilities is a

planned-for and acceptable tradeoff between power and performance (e.g.)TANSTAAFL, Do More With LessReduce idle activity and power consumptionValidate new platform power management using Power

Efficiency Diagnostics ISV/IHV Call to Action for Power: eliminate activity

during workload idle periods in applications and driversTarget average idle period at minimum >100msProvide software with adjustable tradeoffs between power

and performance when appropriate

Page 143: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Call To Action - 2 Build power efficient platforms and solutions

Expose complete processor (and memory and device) information from BIOS

Ensure drivers and applications work with core parking enabled

Speak with Microsoft about creating ACPI-based power meter and supply devices

Get the Enhanced Power Management logo Review microsoft.com power whitepapers

and presentations

Page 144: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

The Power of WinHEC 2008!COR-

T540 Windows 7 Power Management Overview

MBL-T541 Improving Platform Energy Efficiency (part 1)

MBL-T541 Improving Platform Energy Efficiency (part 2)

COR-C622

Discussion: Windows 7 Power Management

COR-T542

NDIS 6.20: Core Network Power Management Fundamentals

COR-C633

Microsoft Tools for Energy Efficiency Diagnostics

ENT-T551 Windows Server Power Management Overview

ENT-T552 Windows Server Power Management Implementation Details

ENT-C630 Windows Server and Intel® Dynamic Power Technology for Data Centers

COR-S559

Power-Performance Benchmarks, AMD, andScalable Windows with HP Integrity Servers, HP

Page 145: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Additional Resources WDK available with pre-Beta Web Resources:

White papers and presentations at www.microsoft.com (search on “power”)○ http://www.microsoft.com/whdc (search on “power”)

Windows Hardware Developer Central – Power Management: …/whdc/system/pnppwr/ Processor Power Management in Windows Vista and Windows Server 2008: …/whdc

/system/pnppwr/powermgmt/ProcPowerMgmt.mspx ACPI / Power Management: …/whdc/system/pnppwr/powermgmt/default.mspx Recommendations for Power Budgeting with Windows Server: …/whdc/system/pnppwr/

powermgmt/Svr_PowerBudget.mspx Active State Power Management in Windows Vista: …/whdc/connect/pci/aspm.mspx

○ Windows Server 2008 Power Savings http://download.microsoft.com/download/4/5/9/459033a1-6ee2-45b3-ae76-a2dd1da3e81b/Windows_Server_2008_Power_Savings.docx

○ Designing Efficient Background Processes for Windows (Trigger-Start Services): http://go.microsoft.com/fwlink/?LinkId=128622

ACPI Specifications: http://www.acpi.info 80 Plus Program for power supplies: http://www.80plus.org Energy Star Power Supply Specification Draft:

http://www.energystar.gov/ia/partners/prod_development/new_specs/downloads/Draft1_Server_Spec.pdfE-mail: Server Power Feedback alias [email protected]

Page 146: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Sources Estimating Total Power Consumption by Servers in the U.S. and the World –

Jonathan G. Koomey, Ph.D. http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

Bureau of Labor Statistics http://data.bls.gov/cgi-bin/cpicalc.pl

US Energy Information Administration http://www.eia.doe.gov/fuelelectric.html

AFCOM Data Center Institute’s Five Bold Predictions, 2006 http://www.afcom.com/News_Releases/Afcom_In_The_News_05010601.asp

Intel Server Products Power Budget Analysis Tool http://www.intel.com/support/motherboards/server/sb/cs-016976.htm

Data center TCO benefits of reduced air flow -- Malone, Vinson, and Bash Various Gartner press releases Aperture Research Institute EYP Mission Critical Facilities Inc. Power In, Dollars Out: How to Stem the Flow in the Data Center

http://www.microsoft.com/whdc/system/pnppwr/powermgmt/Svr_Pwr_ITAdmin.mspx

Page 147: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 148: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Find references for the following figures and include in the Resources slide

Page 149: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

15 MW Datacenter Monthly Costs“Good” (PUE=1.7) Internet-scale datacenter with DAS

Servers$3,000,000

Infrastructure$1,800,000

Power$1,000,000

3 yr server and 15 yr infrastructure amortization

Page 150: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

2004 Energy Consumption = ~ 100 quads2004 Energy Expenditures = ~ $910 billion

0

4000

8000

12000

16000

20000

24000

28000

32000

36000

1940 1950 1960 1970 1980 1990 2000 2010

In d us t r ia l = redT rans p o rtat io n = p u rp leRes id en t ia l = g reenCo mmerc ia l = b lue

U .S. Energy C onsumption1949 - 2004

A ll Fuels (TB TU )

Growing Energy Demand

Page 151: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Datacenter Costs Breakdown - 1

Page 152: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Electricity Use by End-Use: 2000 - 2006

Page 153: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Page 154: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

BACKUP SLIDES

Page 155: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

POWER METERING and BUDGETING

Page 156: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering In the future, servers are likely to have

onboard power metersAC power (into the power supply)DC power (out of the power supply)For individual components (CPU, RAM, IO,

fans, disks, …) WS08R2 provides the capability to monitor

such meters, as well as communicate with power management logic, through standard Windows and ACPI interfaces

Page 157: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting WS08R2 introduces the ability to report power consumption and

budgeting information Server platform reports this in-band to the OS via ACPI No additional drivers are required or HW changes, only platform support

Power information is exposed via WMI Adheres to the DMTF Power Supply Profile v1.01

Power budget information is reported to the OS Optional support for configuring the budget from within Windows

Extendable to enable per-device metering WDM driver interface available

Design goals Standard hardware and software interfaces Native infrastructure, easily extendable Leverages existing platform technology

Page 158: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting

System Center

.

.

.

WMI ConsumersWMI

Namespaceroot\cimv2\power

Power Supply classPower Meter classPower Meter Events

User-mode Power Service

Power WMI

providers

Standard Windows IOCTL interface

In-box ACPI-based

implementation

Vendors provide ACPI code in

firmware

Other vendor specific

implementations…

Implemented in WS08R2

BMC hardware

Admin scripts

Hardware

Management tools

Page 159: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – WMI

Based on the DMTF management profilesNew power namespace – root\cimv2\power

1) Power supply deviceInventory informationCapabilities/characteristicsRedundancy information

CIM_NumericSensor

Win32_PowerMeter

CIM_PowerSupply

Win32_PowerSupply

_ExtrinsicEvent

Win32_PowerMeterEvent

Win32_PowerSupplyEvent

Page 160: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – WMI

2) Power meter deviceInventory informationCapabilities/characteristicsLatest meter measurementsOS-Configurable trip-pointsConfigurable platform enforced limit

3) Power supply/meter eventsNotification for changes in configuration and capabilitiesNotification for trip-points crossed and platform limit enforced

Page 161: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – Usage

Statistical/inventory/auditingData center can monitor power consumption across nodesAdministrator can write scripts to control power policies and receive power condition eventsModel can be extended to per-device metersAnother set of metrics for virtualization and consolidation

Page 162: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – WDM

Standard Windows driver IOCTL interfaceEvent model based on pending IO requests (IRPs)Two separate device interfacesConsumed by the WMI providersAn alternative to the ACPI implementationFuture direction – potentially consumed by the kernel power managerDocumented on MSDN

Page 163: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI

RationaleWorks as the abstraction layer to the underlying platform technology (IPMI, WSMAN, etc.)Scales across different platformsDoes not require special driversRequires only firmware updates

Currently being proposed to the ACPI 4.0 specificationDelegate tasks to the BMC (e.g., rolling average calculation, polling for events, etc.)

Page 164: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI

Power supply deviceExtends the current power source deviceControl method to publish capabilities

Power meter deviceSimilar to control method batteriesA set of control methods to get capabilities and set configuration parameters, trip points, and configure hardware enforced limitsEvent notification via Notify codes

Page 165: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power Metering and Budgeting – ACPI

WS08R2 will provideIn-box driver to support power meter device(s) described in ACPIIn-box IPMI operation region handler as part of the Microsoft IPMI driver – allowing ACPI control methods to communicate with IPMI using the KCS protocol

Format similar to the SMBUS OpRegion3rd-party IPMI drivers can register OpRegion handler for other IPMI protocol(s)Also proposed to ACPI 4.0 specification

Page 166: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Architecture Details

IPMI OpRegion encountered

acpipmi.sys

User-mode Power ServicePower supply

providerPower meter provider

Power supply interface

Power meter interface

acpipsu.sys

xyzpsu.sys xyzpmi.sys

Power Manag

er

acpi.sys

User mode

Kernel mode

FirmwareACPI control methods

E.g., Query power supply information, Set power meter trip points, Get power meter capabilities

IOCTLs

IOCTLs

WDF drivers

Microsoft IPMI driver (ipmidrv.sys)

BMC hardware

IPMI KCS protocol

IOCTLs

Interprets

Power policy provider

IPMI handler

Event feedback

Power policy feedback

Page 167: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

STORAGE: SSDs

Page 168: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Flash SSD versus HDD (Jun ‘08)

HDD Flash SSD

Endurance (write cycles per bit) 10^12 10^5 (SLC*)10^6 (MLC*)

Cost per byte 1x 2.5x – 25x

Performance : Small random read requests 1x 10 – 100x

Active Power (Watts/byte) 10-20x 1x

Shock Resistance Non-operating Operating

100g 200g (2010)~10g

1500g100g

Thermal (°C) 5-55 0-70

* SLC – Single Level Cell* MLC – Multi Level Cell

Page 169: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Flash Characteristics (Jun ’08) Chip Read50 MB/s

Write 25 MB/s Scales with number of chips

Read Latency 25 μs to start, 100 μs for 2KB “page”

Write Latency 200 to 300 μs for 2KB “page” 2,000 μs to erase

Active Power 1-2 Watts for 8 chips + controller

Page 170: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD High-IOps Workload TCO Decrease TCO for IOps-intensive systems

IOps bottleneck causes customers to buy spindles instead of capacity, driving up TCO and operational complexity (e.g., workload balancing)

SSDs provide less expensive systems for same performance targets

Smaller form factors

Page 171: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD Performance Concerns - 1 Random write perf

Could be alleviated with next generation of products

New technological problems may arise with future generations (no guarantee that it will stay at same level)

Potential bottleneck on erasing/block cleaning Mixing workloads creates unexpected

performance characteristicsRead:write ratio, request sizes, sequentiality

Page 172: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD Performance Concerns - 2 First-pass performance might be better than

steady-state When nearing EOL, perf may degrade as blocks

are removed from pool Does mapping metadata have be

re-read/initialized after power failure? Need enough onboard parallelism to keep internal

serial interfaces from becoming bottlenecks Just like disk arrays, the wrong stripe unit size

can kill perf in an SSD array

Page 173: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Enables Improved Endurance for SSD Technology SSD can identify itself differently from HDD in ATA as

defined by ATA8-ACS Identify Word 217: Nominal media rotation rate

Reporting non-rotating media will allow WS08R2to set Defrag off as default; improving device endurance by reducing writes

Page 174: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Enables Optimization for SSD Technology Microsoft implementation of “Trim” feature

NTFS will send down delete notification to the device supporting “trim”○ File system operations: Format, Delete, Truncate, Compression

○ OS internal processes: e.g., Snapshot, Volume Manager

Three optimization opportunities for the device Enhancing device wear leveling by eliminating merge

operation for all deleted data blocksMaking early garbage collection possible for fast write Keeping device’s unused storage area as high as

possible; more room for device wear leveling.

Page 175: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Parallelism Tradeoffs No one scheme optimal for all workloads

Highly sequential Striping, ganging (for scale), and interleaving

Inherent parallelism in workload

Independent, deeply parallel request streams to the flash chips

Poor cleaning efficiency (no locality) Background, intra-chip cleaning

With faster serial connect, intra-chip ops are less

important

Page 176: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD Performance Trends(Sequential write)

Sequential performancecontinues to

improve

MLC drive’s Performance is increasing

Sequential performanceadvantage is big and real

Source: a subset of sample data from internal lab

Page 177: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD Performance Trends(Random write)

Random write speed is

increasing

MLC drive’s random write

is also improving

Random performance

issues are being solved

Source: a subset of sample data from internal lab

Page 178: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SSD Cost Trends $

Source: Semiconductor Forecast Worldwide--Forecast Database [SEQS-WW-DB-DATA], Gartner August 2008, by Joe Unsworth, et al.

The mark shows where

the cost is today

And it continues

down

The device cost

will be in affordable

range by 2010

1TB SSD is on the radar

Page 179: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

VIRTUALIZATION

Page 180: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power Management Full P-state/C-state management already

integrated between Windows root partition and Hyper-V v1 (WS08)

Enlightenments, such as timer assist added in Hyper-V v2 (WS08R2) Hypervisor delivers child clocks without requiring

root interaction, plus ITTDCore parking enabled for all partitions

Page 181: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Hyper-V Core ParkingOverview

Scheduling virtual machines on a single server for density as opposed to dispersionThis allows “park/sleep” cores by putting them into deep C states

BenefitsSignificantly enhances Green IT by being able to reduce power required for CPUs

Idle improvements extend to Hyper-V Significant reduction in platform interrupt activityEnables power savings and greater scalability

Page 182: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Windows Server 200816 LP Server

Page 183: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WS08R2 Hyper-V Core Parking16 LP Server

Processor is

“parked”

Processor is

“parked”

Page 184: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

VIRTUALIZATION TESTS

Page 185: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power EfficiencyWindows Server Performance Lab Testbed Configurations Single Workloads

Web FundamentalsSPECpower

Mixed Workloads

Page 186: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power EfficiencyWorkload configuration - 1 Methodology for obtaining power load line

data for TPC-C, TPC-E, FSCT, and Web Fundamentals have been demonstratedBenchmark loads varied by throttling number of

active usersMultiple workloads tested in Hyper-V environment

SPECpower has been successfully tuned

Page 187: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Workload Characteristics Web Fundamentals (WF)

Dynamic scenarioCPU-bound workload

SPECpower (modified SPECjbb)Kit version 1.0Java version: JDK 1.6.0_02JVM options: -Xms1024m -Xmx1024m -

XXaggressive -XXlargePages -XXthroughputCompaction -XXcallprofiling -XXlazyUnlocking -Xgc:genpar -XXgcthreads:2 -XXtlasize:min=8k,preferred=1024k

Page 188: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power EfficiencyWorkload configuration - 2 Single workloads

All the guests run the same workloadTwo scenarios:

○ Fixing the number of active guests and scaling the load in each guest

○ Fixing the load in each guest and activating more guests Mixed workloads

Half of guests run each workload○ Fixed load in WF guests (~35% CPU utilization each)○ Varying load in SPECpower guests

Page 189: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

HW and SW Test Configurations Hardware

2-socket quad-core processors○ Minimal P-States

16GB memory: 4x4GB 667MHz DIMMsExternal (wall) power monitor

SoftwareOS: Windows Server 2008

○ OS Power Management: Balanced modeHyper-V v2 (pre-release build)

○ Configured with 8 guestsSingle virtual processor: 3.16GHz1.75GB memory

Page 190: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Web Fundamentals Dynamic Adding Load to Each Guest

0 5000 10000 15000 20000 25000250

270

290

310

330

350

370

Throughput (Requests / Sec)W

atts

Throughput and power usage versus total system utilization

Power usage for various throughput levels

0% 5000% 10000%250

270

290

310

330

350

370

0

5000

10000

15000

20000

25000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (R

eque

sts

/ Sec

)

For these experiments, the highest system utilization under WF workload is ~80%. This issue has been subsequently resolved.

Page 191: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Web Fundamentals Dynamic Activating Guests - 1

0% 5000% 10000%250

270

290

310

330

350

370

0

5000

10000

15000

20000

25000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (

requ

ests

/ se

c)

Throughput and power usage versus total system utilization

Power usage for various throughput levels

• Data points from left to right: 0 guest, 1 guest, 2 guests, …, 8 guests active

• Each active guest tries to run at the maximum load

0 5000 10000150002000025000250270290310330350370

Throughput (requests / sec)W

atts

Page 192: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Web Fundamentals Dynamic Activating Guests - 2

Virtual processor utilizations for different numbers of active guests

• The maximum utilization of each guest decreases as more guests are activated. Most of this decrease has been subsequently removed.

1 2 3 4 5 6 7 80

102030405060708090

100Guest 1Guest 2Guest 3Guest 4Guest 5Guest 6Guest 7Guest 8

Number of Guests

Virtu

al P

roce

ssor

Util

izat

ion

(%)

Page 193: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower Adding Load to Each Guest - 1

Throughput and power usage versus total system utilization

Power usage for various throughput levels

0%20

00%

4000

%60

00%

8000

%

1000

0%250270290310330350370390

0

50,000

100,000

150,000

200,000

250,000

300,000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

0% 20% 40% 60% 80% 100%65%70%75%80%85%90%95%

Workload (% of maximum throughput)Po

wer

(% o

f Max

Wat

ts)

Page 194: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower Adding Load to Each Guest - 2

Average processor frequency for various workload levels

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Workload (% of Max Throughput)

Freq

uenc

y (M

Hz)

Page 195: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower Adding Load to Each Guest - 3

Virtual processor utilizations for various workload levels

Logical processor utilizations for various workload levels

• All the guests change load levels concurrently.• The VM scheduler is biased towards utilizing higher numbered

processors.

100% 80

%60

%40

%20

%

Active

Idle

0102030405060708090

100Guest 1

Guest 2

Guest 3

Guest 4

Guest 5

Guest 6

Guest 7

Guest 8

Load

Util

izat

ion

(%)

100% 80

%60

%40

%20

%

Active

Idle

0102030405060708090

100Proc 1

Proc 2

Proc 3

Proc 4

Proc 5

Proc 6

Proc 7

Proc 8

LoadU

tiliz

atio

n (%

)

Page 196: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower Activating GuestsThroughput and power usage versus total system utilization

Power usage for various throughput levels

Similar scalability behavior of power and throughput as when adding load.

0%20

00%

4000

%60

00%

8000

%

1000

0%250270290310330350370390

0

50,000

100,000

150,000

200,000

250,000

300,000

Watts ThroughputSystem Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

0% 20% 40% 60% 80% 100%60%65%70%75%80%85%90%95%

100%

Workload (% of maximum throughput)

Pow

er (%

of M

ax W

atts

)

Page 197: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower and WFMixed Workloads - 1

SPECpower throughput and server power usage versus total system utilization

Power usage for various throughput levels

2000

%40

00%

6000

%80

00%

1000

0%250

270

290

310

330

350

370

390

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Watts Throughput

System Utilization Percentage

Wat

ts

Thro

ughp

ut (i

n Th

ousa

nds)

0% 20% 40% 60% 80% 100%60%65%70%75%80%85%90%95%

100%

Workload (% of maximum throughput)Po

wer

(% o

f Wat

ts)

• 4 Guests running WF (4940 req/sec)• ~25% system utilization; ~35% guest virtual processor utilization

• 4 Guests running SPECpower (similar efficiency as single workload)

Page 198: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

SPECpower and WFMixed Workloads - 2

Average processor frequency for various levels of SPECpower workload

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Workload (% of Max SPECPower Throughput)

Freq

uenc

y (M

Hz)

Page 199: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Hyper-V Power EfficiencyFuture Experiments More workloads Different workload mix scenarios

Different combinations of fixed and varying workloads

More VM configurationsMultiple virtual processors per guestOversubscription

Page 200: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WMI

Page 201: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Power WMI Provider Enables power policy configuration through

standard WMI interfaceChange power setting valuesActivate a given planConforms to DMTF data model

To get started…Change a power setting: Win32_PowerSettingActivate a plan: Win32_Plan.Activate() method

Attend ENT-T552 Windows Server Power Management Implementation Details for additional details

Page 202: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Configuration and Administration WMI interfaces to query and set configuration

settingsConfiguration of systemsGlobal administrationManagement applications

WMI interfaces to query current and hardware capabilities3rd party applicationsDiagnostics

Page 203: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

WMI – Set Power Settings

TargetSetting = "Microsoft:PowerSetting\\{3c0bc021-c8a8-4e07-a973-6b14cbcb2b7e}" 'display blank timeoutSet objWMIService = GetObject("WinMgmts:\\.\root\cimv2\power")

Set SettingIndices = objWMIService.ExecQuery(“ASSOCIATORS OF {“ & chr(34) & “Win32_PowerSetting.InstanceID=“ & chr(34) & TargetSetting & chr(34) & “} WHERE ResultClass = Win32_PowerSettingDataIndex”)

For Each SettingIndex in SettingIndices Set Plan = objWMIService.ExecQuery(“ASSOCIATORS OF {“ & chr(34) & SettingIndex.InstanceID & “} WHERE ResultClass = Win32_PowerPlan”) If Plan.IsActive THEN SettingIndex.SettingIndexValue = 120 ‘2 seconds SettingIndex.Put_ Plan.Activate()

Page 204: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Remote Power Manageability - 1 WS08R2 supports the configuration of power policy

via WMI Local and remote management via WMI Adheres to DMTF conventions for setting data Scriptable

Includes support for reading and writing of all power plan and setting data

Active power plan can get changed remotely Power Action can be carried out (sending a node to

S3)

Page 205: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Remote Power Manageability - 2

Win32_PowerSettingDefinitionPossibleValue

Win32_PowerSettingDataIndex

Win32_PowerSettingInSubgroup

Win32_PowerSetting

Win32_PowerSettingCapabilities

Win32_PowerSettingDefineCapabilities

Win32_PowerSettingDataIndexInPlan

Win32_PowerSettingDefinitionRangeData

Win32_PowerSettingElementDataIndex

Win32_PowerSettingDefinition

Win32_PowerSettingSubgroup

Win32_PowerPlan

Win32_PowerSettingDefineCapabilities

objectsassociation

Class relationship

Page 206: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Remote Power Manageability - 3

Get the Active Plan:Set objWMIService = GetObject("WinMgmts:\\.\root\cimv2\power")

Set PowerPlans = objWMIService.InstancesOf("Win32_PowerPlan")

For Each PowerPlan in PowerPlans If PowerPlan.IsActive Then wscript.echo "Current Plan: " & PowerPlan.ElementName End IfNext

Set the Active Plan:PowerPlan.Activate()

Page 207: 406 bruce worthington_windows_server_power_efficiency_slideshare

CMG‘08 INTERNATIONAL conference

Remote Power Manageability - 4

Get all power settings in the Active Plan:(Continued with PowerPlan)

EscapedInstanceID = Replace(PowerPlan.InstanceID, "\", "\\")Set PowerSettingIndexes = objWMIService.ExecQuery( "ASSOCIATORS OF {Win32_PowerPlan.InstanceID=" & chr(34) & EscapedInstanceID & chr(34) & "}")

For Each PowerSettingIndex in PowerSettingIndexes

EscapedInstanceID = Replace(PowerSettingIndex.InstanceID, "\", "\\") Set PowerSettings = objWMIService.ExecQuery( "ASSOCIATORS OF {Win32_PowerSettingDataIndex.InstanceID=" & chr(34) & EscapedInstanceID & chr(34) & "} WHERE ResultClass = Win32_PowerSetting")

For Each PowerSetting in PowerSettings wscript.echo “Power Setting: “ & PowerSetting.InstanceID wscript.echo “Description: “ & PowerSetting.Description wscript.echo “Index Value: “ & PowerSettingIndex.SettingIndexValue