resource management: assessing the warm water facilities

Post on 24-Jan-2022

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Resource Management: Assessing the Warm Water Facilities and Systems Supporting ORNL’s SummitJim Rogers

Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory

22

ORNL Leadership Class Systems 2004 - 2018

2012Cray XK7

Titan

27PF

18.5TF

25 TF

54 TF

62 TF

263 TF

1 PF

2.5PF

2004Cray X1E Phoenix

2005Cray XT3

Jaguar

2006Cray XT3

Jaguar

2007Cray XT4

Jaguar

2008Cray XT4

Jaguar

2008Cray XT5

Jaguar

2009Cray XT5

Jaguar

From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C / 42°F

supply) with annualized PUEs to ~1.4

33

ORNL’s Transition to Warmer Facility Supply Temperatures

~1.5EF

200PF

27PF

2012Cray XK7

Titan

2021/2022HPE/Cray Frontier

2018/2019IBM

Summit

Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers

Summit: A combination of direct on-package cooling and RDHX with 21°C / 70°Fsupply is > 95% room-neutral.• Above dewpoint• Contribution by chillers

~20% of the hours of the year

Frontier: Custom packaging is >95% room-neutral with a 32°C / 90°F supply.• ~100% Evaporative Cooling,

with supplemental HVAC for parasitic loads

4

Motivation for “Warmer” Cooling Solutions Serving HPC Centers

• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional

chillers. • No chillers, no ozone-depleting refrigerants (GREEN)

– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)

Oak Ridge, TN

– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.

• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)

55

OLCF Facilities Supporting Summit• Titan – 9MW @ heavy

load• Sitting on 250

pounds/ft2 raised floor• Uses 42F water and

special CDUs (XDPs)

• Summit – 256 compute cabinets on-slab

• 100% room-neutral design uses RDHX

• 20MW warm-water cooling plant using centralized CDU/secondary loop

66

• Summit– Demand: 3.3 (idle)-11.5MW;

• Secondary Loop– Supply 3300GPM (12,500

liters/min) @ 21°C; Return @ 29-33°C

– CPUs and GPUs use cold plates– DIMMs and parasitic loads use

RDHX– Storage and Network use RDHX

21°C/20MW/7700 ton Facility System Design System

77

Medium Water Temperature Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)

When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.

The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit

A New Challenge – Titan and rotated off-line, so there is no significant load on the Chillers[1-5]. NOAA moved to the warm water loop (~500GPM demand)

Existing chilled water cooling loop

New primarycooling loop

88

99

10

Benefits of ORNL’s Warm-Water (21°C) Mechanical Plant

Warm Water• Total Capacity to manage 20MW of IT

load

• Warm Water allows annualized PUE of 1.1– For each ~$1M cost per MW-year for

consumption on Summit;– A corresponding ~$100k cost per MW-year

for waste heat management

• Titan was $6-7M/year plus $1.8-$2M to remove the waste heat

• Summit will have a similar “power bill”, but closer to $500k annually for waste heat.

Integrated System/Facility Operation• Integration with the PLC allows us to tune

water pressure and flow– Better delta(t); less pumping energy

• Integration with IBM’s OpenBMC provides information necessary to protect ~37K CPUs and GPUs from inadequate flow across the cold plates

• Integration with the scheduler allows us to correlate power and temperature data with individual applications

11

Summit’s Power Demand, Aug 2018– Aug 2019

Early Access, Gordon Bell, Acceptance Testing

Transition to Operations

Full Production

12

Detailed Analysis of Summit’s Annual PUE

1.00

1.05

1.10

1.15

1.20

1.25

1.30

0

2000

4000

6000

8000

10000

12000

14000

2018

/08/

0120

18/0

8/11

2018

/08/

2220

18/0

9/02

2018

/09/

1220

18/0

9/23

2018

/10/

0420

18/1

0/14

2018

/10/

2520

18/1

1/05

2018

/11/

1620

18/1

1/26

2018

/12/

0720

18/1

2/18

2018

/12/

2820

19/0

1/08

2019

/01/

1920

19/0

1/30

2019

/02/

0920

19/0

2/20

2019

/03/

0320

19/0

3/13

2019

/03/

2420

19/0

4/04

2019

/04/

1520

19/0

4/25

2019

/05/

0620

19/0

5/17

2019

/05/

2720

19/0

6/07

2019

/06/

1820

19/0

6/28

2019

/07/

0920

19/0

7/20

2019

/07/

3120

19/0

8/10

2019

/08/

21

PUE

Cool

ing

(kW

cm)

Date

Cooling Source August 1 2018-August 31 2019 With 1 Week PUE

CHW CTW 1 Week PUE (kW)

Seasonal Transition (Summer to Fall)

Winter/Spring: 100% Evaporative Cooling

Seasonal Transition (to Summer)

1313

Summit’s RM Challenge – Data Volume

13

TOTAL ~470,000 metrics per second available for real-time analytics

• Data Streams Include:

• IBM OpenBMC framework (99 metrics/node/second x 4608 nodes = ~460,000/sec) - OOB

• IBM LSF jobs data for running applications (~10sec update interval)

• NOAA weather/wet bulb for Oak Ridge area (continuous, external)

• Programmable Logic Circuit (PLC) water flow (continuous, protected as part of BAS)

14

Sample from Grafana Dashboard (live version shown)

15

MTW Performance

Summit (System) Impact:• No substantive impact to CPU or GPU

operating conditions (junction temperatures)

• One degree increase in room temperature (worst case about 72.5° F)

Year over Year, 2018 to 2019:• IT Demand (kW-h) is up 19%• CHW use is 39% LESS than in 2018• The one-degree adjustment of supply

temperature provided a savings of $40,000

2018:Supply 70°F & ~4500GPM 2019:Supply 71°F & ~3300GPM

Energy used by Chilled Water

Total Energy, Mechanical Plant

PUE Improves

Summit IT Load (kW-h) (Cumulative) Increases

Wet Bulb

1616

Summit Power/Jobs Every Second

Click toPlayMovie

1717

Summit Temperature/Jobs Every Second

Click toPlayMovie

1818

top related