performance analysis of computer systems - tu dresden · – architecture and performance analysis...

69
Holger Brunst ([email protected] ) Matthias S. Mueller ([email protected] ) Center for Information Services and High Performance Computing (ZIH) Performance Analysis of Computer Systems Introduction

Upload: lecong

Post on 05-Aug-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis of Computer Systems

Introduction

Page 2: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Organization

Lecture: Every Wednesday in INF E001 from 13:00 to 14:30

Labs: Every Thursday in INF E069 from 13:00 to 14:30

First Exercise: October 20st

– You need an account in the PC pools!!

All slides will be in English

Ten minute summary of last lecture at the beginning of each lecture

List of attendees

Slide 2 LARS: Introduction and Motivation

Page 3: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Class Material on the Web

Slides will be put on the web prior or shortly after each class

The slides from last year are still online.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1011/lars

Be aware of upgrades for this term.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1012/lars

Slide 3 LARS: Introduction and Motivation

Page 4: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Class Outline (tentative)

15 lectures with corresponding exercises

Class structure

– Introduction and motivation

– Performance requirements, metrics, and common evaluation mistakes

– Workload types, selection, and characterization

– Commonly used benchmarks

– Monitoring techniques

– Capacity planning for future systems

– Performance data presentation

– Summarizing measured data

– Regression models

– Experimental design

– Performance simulation and prediction

– Introduction to queuing theory

Slide 4 LARS: Introduction and Motivation

Page 5: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Literature

Raj Jain: The Art of Computer Systems Performance Analysis

John Wiley & Sons, Inc., 1991 (ISBN: 0-471-50336-3)

Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN:3-519-02144-7)

Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: 0-444-81518-x)

Slide 5 LARS: Introduction and Motivation

Page 6: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Introduction and Motivation

Why is Performance Analysis Important?

Page 7: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Overview

Development of hardware performance

Implications on application performance

Compute power at Technische Universität Dresden

Research at ZIH

Some advertising

Slide 7 LARS: Introduction and Motivation

Page 8: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Moore’s Law: 2X Transistors / “year”

“Cramming More Components onto Integrated Circuits”

Gordon Moore, Electronics, 1965

# on transistors / cost-effective integrated circuit double every N months (18 N 24)

Slide 8 LARS: Introduction and Motivation

Page 9: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance Development in TOP500

Slide 9 LARS: Introduction and Motivation

Page 10: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Extrapolation to Exascale

Matthias S. Müller

100 Pflop/s

10 Pflop/s

1 Eflop/s

100 Tflop/s

1 Pflop/s

1 Tflop/s

100 Gflop/s

10 Tflop/s

1 Gflop/s

100 Mfl /

10 Gflop/s

100 Mflop/s

Erich Strohmaier: Highlights of the 37th TOP500 List, ISC‘11

Page 11: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

John Shalf (NERSC, LBNL)

Slide 11 LARS: Introduction and Motivation

Page 12: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Number of Cores per System is Increasing Rapidly

Total # of Cores in Top15

0

200000

400000

600000

800000

1000000

1200000

Ju

n 9

3

De

z 9

3

Ju

n 9

4

De

z 9

4

Ju

n 9

5

De

z 9

5

Ju

n 9

6

De

z 9

6

Ju

n 9

7

De

z 9

7

Ju

n 9

8

De

z 9

8

Ju

n 9

9

De

z 9

9

Ju

n 0

0

De

z 0

0

Ju

n 0

1

De

z 0

1

Ju

n 0

2

De

z 0

2

Ju

n 0

3

De

z 0

3

Ju

n 0

4

De

z 0

4

Ju

n 0

5

De

z 0

5

Ju

n 0

6

De

z 0

6

Ju

n 0

7

De

z 0

7

Ju

n 0

8

De

z 0

8

Pro

cesso

rs

Slide 12 LARS: Introduction and Motivation

Page 13: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Number of Cores per System is Increasing Rapidly

Slide 13 LARS: Introduction and Motivation

Page 14: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Cray XT5 (Jaguar) at Oak Ridge National Laboratory

Slide 14 LARS: Introduction and Motivation

Page 15: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

IBM Roadrunner at Los Alamos National Laboratory

First computer to surpass the 1 Petaflop (250 FLOPS ) barrier

Installed at Los Alamos National Laboratories

Hybrid Architecture

13,824 AMD Opteron cores

116,640 IBM PowerXCell 8i cores

Costs: $120 Mio.

Slide 15 LARS: Introduction and Motivation

Page 16: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

IBM BlueGene/P (JUGENE) at Research Centre Jülich

Number five in TOP 500

Installed at Forschungszentrum Jülich

72 Racks with 32 node cards x 32 compute cards (total 73728)

294,912 PowerPC 450, 850 MHz

144 TB main memory

Slide 16 LARS: Introduction and Motivation

Page 17: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

K Computer System

Nr. 1 System in TOP500 (June 2011)

“K” means 10^16

>80,000 Processors

>640,000 Cores

10 MW power consumption

SPARC64 VIIIfx CPU

16 GB/node, 2 GB/core

Direct water cooling

Slide 17 LARS: Introduction and Motivation

Page 18: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

What Kind of Know-How is Required for HPC?

Algorithms and methods

Performance Analysis

Programming (Paradigms and details of implementations)

Operation of supercomputers (network, infrastructure, service, support)

Slide 18 LARS: Introduction and Motivation

Page 19: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Challenges

Languages

– Fortran95, C/C++, Java,

– Also scripting languages!

Parallelization:

– MPI, OpenMP, CUDA/SILC, Threading

Network

– Ethernet, Infiniband, Myrinet, …

Scheduling

– Distributed components, job scheduling, process scheduling

System architecture

– Processors, memory hierarchy

Slide 19 LARS: Introduction and Motivation

Page 20: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Application Performance

Page 21: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

From Modeling to Execution

Slide 21 LARS: Introduction and Motivation

Page 22: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Environment / Machine Room

Why is performance hard and energy efficiency harder?

– Both are vertical/cross cutting problems:

Matthias Mueller

Digital Logic Level

Microarchitectur Level

Hardware

Instruction Set Architecture Level

Operating System Machine Level

Assembly Language Level

Problem-oriented Language Level

Performance

Energy Efficiency

Design and Implementation

Algorithm

Mathematical Representation

Modeling

Page 23: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Short History of X86 CPUs

CPU Year Bit

Width

#Transistors Clock Structure L1 / L2 /L3

4004 1971 4 2300 740 kHz 10 micro

8008 1972 8 3500 500 kHz 10 micro

8086 1978 16 29.000 10 Mhz 3 micro

80286 1982 16 134.000 25 MHz 1.5 micro

80386 1985 32 275.000 33 Mhz 1 micro

80486 1989 32 1.200.000 50 MHz 0.8 micro 8K

Pentium I 1994 32 3.100.000 66 MHz 0.8 micro 8K

Pentium II 1997 32 7.500.000 300 MHz 0.35 micro 16K/512K*

Pentium III 1999 32 9.500.000 600 MHz 0.25 micro 16K/512K*

Pentium IV 2000 32 42.000.000 1.5 GHz 0.18 micro 8K/256K

P IV F 2005 64 2.8- 3.8 GHz 90 nm 16K/2MB

Core i7 2008 64 781.000.000 3.2 GHz 45 nm 32K/256K/8MB

Slide 23 LARS: Introduction and Motivation

Page 24: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Intel Nehalem

Released 2008

4 cores

781.000.000 transistors

45nm technology

32 K L1Data, 32K L1Instruction

256 K L2

8 MB shared L3 cache

Hyperthreading

3.2 GHz*4 cores*4 FLOPS/cycle = 51.2 Gflop/s peak

Integrated memory controller

QPI between processors

Slide 24 LARS: Introduction and Motivation

Page 25: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Nehalem Core

Execution

Units

Out-of-Order

Scheduling & Retirement

L2 Cache

& Interrupt Servicing

Instruction Fetch

& L1 Cache

Branch Prediction Instruction

Decode & Microcode

Paging

L1 Data Cache

Memory Ordering

& Execution

Slide 25 LARS: Introduction and Motivation

Page 26: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Potential factors limiting performance

“Peak performance”

Floating point units

Integer units

… any other feature of micro architecture

Bandwidth (L1,L2,L3, main memory, other cores, other nodes)

Latency (L1,L2,L3, main memory, other cores, other nodes)

Power consumption

Slide 26 LARS: Introduction and Motivation

Page 27: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance development in TOP500

Slide 27 LARS: Introduction and Motivation

Page 28: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Develops the rest of the system at CPU speed?

μProc 60%/yr. (2X/1.5yr)

DRAM 9%/yr. (2X/10 yrs) 1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Perform

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Slide 28 LARS: Introduction and Motivation

Page 29: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance Trends measured by SPECint

Source: Hennessy, Patterson: „Computer Architecture, a quantitative approach .

Slide 29 LARS: Introduction and Motivation

Page 30: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

CPUint2006 development 2005 - 2009

Slide 30 LARS: Introduction and Motivation

Page 31: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance Trends measured by SPECint

2009

23%

Slide 31 LARS: Introduction and Motivation

Page 32: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

CPUfp2006 development 1991 - 2009

CPU 95

Released 1995

602 results between 3/1991 and 1/2001

CPUfp2000

Released 2000

1385 results between 10/1996 and 2/2007

CPUfp2006

Released 2006

1217 results between 4/1997 and 4/2009

42%

33%

30%

Slide 32 LARS: Introduction and Motivation

Page 33: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance Trends over a 20 years life cycle

Slide 33 LARS: Introduction and Motivation

Page 34: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance Trends over a 20 years life cycle

Where is your

application?

Slide 34 LARS: Introduction and Motivation

Page 35: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center of Information Services and HPC

A short introduction

Page 36: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

HPC in Germany: Gauß-Allianz

Matthias S. Müller

Members:

– GCS, HLRN (RRZN, ZIB), RWTH,

TU Dresden, RZG, TU Darmstadt,

DWD, DKRZ, SCC

– G-CSC, PC^2, RRZE, DFN,

DESY, RRZK

Köln

RRZK

Page 37: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Responsibilities of ZIH

Providing infrastructure and qualified service for TU Dresden and Saxony

Research topics

– Architecture and performance analysis of High Performance Computers

– Programming methods and techniques for HPC systems

– Software tools to support programming and optimization

– Modeling algorithms of biological processes

– Mathematical models, algorithms, and efficient implementations

Role of mediator between vendors, developers, and users

Pick up and preparation of new concepts, methods, and techniques

Teaching and Education

Slide 37 LARS: Introduction and Motivation

Page 38: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Compute Server Infrastructure

HPC - Komponente

Hauptspeicher 6,5 TB PC - Farm

HPC - SAN

Festplatten - kapazität :

68 TB

PC - SAN

Festplatten - kapazität :

68 TB

PetaByte - Bandarchiv

Kapazität : 1 PB

8 GB / s 4 GB / s 4 GB / s

1 , 8 GB / s HPC-Component

– SGI® Altix® 4700

– 2048 of

– MonteCito Cores

– 6.5 TByte main memory

PC-Farm

System from Linux Networx

AMD opteron CPUs (dual core, 2.6 GHz)

728 boards with 2592 cores

Infiniband networks between the nodes

Slide 38 LARS: Introduction and Motivation

Page 39: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

HPC-System: SGI Altix 4700 (Mars)

32 x 42U Racks

1024 x Sockets with Itanium2 Montecito Dual-

Core CPUs (1.6 GHz/9MB L3 Cache)

13 TFlop/s peak performance

11.9 TFlop/s linpack

6.5 TB shared memory

Slide 39 LARS: Introduction and Motivation

Page 40: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Linux Networx PC-Farm (Deimos)

– 26 water cooled racks (Knürr)

– 1296 AMD Opteron x85 Dual-Core CPUs (2,6 GHz)

– 728 compute nodes with 2 (384), 4 (232) or 8 (112) cores

– 2 Master- und 11 Lustre-Server

– 2 GB memory per core

– 68 TB SAN disc (RAID 6)

– Local scratch discs (70, 150, 290 GB)

– 2 4x-Infiniband Fabrics (MPI + I/O)

– OS: SuSE SLES 10

– Batch system: LSF

– Compiler: Pathscale, PGI, Intel, Gnu

– ISV-Codes: Ansys100, CFX, Fluent, Gaussian, LS-DYNA, Matlab, MSC

Slide 40 LARS: Introduction and Motivation

Page 41: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Computer Rooms – Extension to the Building

Slide 41 LARS: Introduction and Motivation

Page 42: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Performance of Supercomputers at ZIH

0,0001

0,001

0,01

0,1

1

10

100

1000

10000 T

FL

OP

S

Jahr

Cray T3E 28 GFlops

Platz 237 VP200-EX 472 MFlops

Platz 500

SGI Origin 2000 16,5 GFlops

Platz 236

SGI Origin 3800 85,4 GFlops

Platz 351

Rang 1

Rang 10 Rang 500

PC-Farm 10,88 TFlops

Platz 79

SGI Altix 11,9 TFlops

Platz 49

HRSK-II Stufe 1

HRSK-II Stufe 2 RWTH

300 TFlops

Matthias S. Müller

Page 43: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Folie 43

Beschaffung der Auswertekomponente des DataCenter

Michael Kluge

Megware 2011, 75 KW, 50 Tflop/s Peak 4 Racks, 5760 Kerne, 12 TiB Hauptspeicher 90 Knoten mit je 64 Kernen pro Knoten heterogener Speicherausbau 64-512 GiB pro Knoten

Linux Networks 2006, 250 KW, 13 Tflop/s Peak 26 Racks, 2576 Kerne, 5,5 TiB Hauptspeicher

HRSK/152

Page 44: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Folie 44

Auswertekomponente - Netzwerktopologie

Michael Kluge

QDR Leaf

QDR Leaf

QDR Leaf

QDR Spine

QDR Spine

QDR Leaf

QDR Leaf

deimos SDR Fabric

Lustre

20 Knoten

20 Knoten

20 Knoten

20 Knoten

20 Knoten

TU Dresden /HOME /BACKUP

TU Dresden Campus Ethernet (10Gbit/s) / DFN Uplink

724 Knoten

2x L

ogin

2x

Man

agem

ent

Eth

erne

t

Scratch

Page 45: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Folie 45

4 Sockel AMD Opteron 6276 Knoten

64 Kerne

– 2,3 GHz (Turbo bis 3,3 GHz)

– Dualcore Module • 16 KiB L1 pro Kern • 2 MiB L2 pro Modul • Eine FPU für beide Kerne

– 16 MiB (2x 8) L3 pro Prozessor • 12 MiB nutzbar, da 4 MiB für HT Assist

2 Chips pro Prozessor

– Verbunden über HyperTransport

– 8 NUMA Nodes in 4 Sockel System

Halbe HT 3.0 Links zwischen Sockeln

– je 12,8 GB/s (6,4 pro Richtung)

128 GiB DDR3-1333, 4 Kanäle pro Sockel

Daniel Molka

Page 46: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Folie 46

Auswertekomponente - Zahlen

Michael Kluge

pro Knoten Megware 2011 LNXI 2006

Kerne 64 (4 Chips) 2-8 (1-4 Chips)

Art der Kerne Opteron 6276 Opteron 285

Hauptspeicher 64-512 GiB 4-32 GiB

Peak Gleitkommaleistung 147 GF/s pro Chip 10,4 GF/s pro Chip

Peak Speicherbandbreite 35 GB/s pro Chip 6,4 GB/s pro Chip

Vernetzung QDR; 4 GB/s; 1μs SDR; 1 GB/s; 1,7μs

Betriebssystem SuSE SLES SuSE SLES

Gesamtsystem Megware 2011 LNXI 2006

Peak Gleitkommaleistung 53 TF/s 13 TF/s

Peak Speicherbandbreite 12 TB/s 8,2 TB/s

Bisektionsbandbreite 120 GB/s 60 GB/s

Batchsystem LSF LSF

Page 47: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Center for Information Services and High Performance Computing (ZIH)

HRSK-II

Plans for the next HPC infrastructure in Dresden

Page 48: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Architecture of future system (HRSK-II)

30 Gbit/s Cluster-Uplink

Durchsatzkomponente

HPC-Komponente

System-

block

Archiv

Home

Eth

ern

et

Eth

ern

et

ZIH-Backbone

HRSK-II-Router

HTW, Chemnitz, Leipzig, Freiberg

Erlangen, Potsdam

SAN-Schicht

Server

Bereich 1

SATA SAS FLASH

Disk Pool

ServerServer

Bereich 2HPC-Server

HRSK-II-Konzept

Flexibles Agiles Speicher System

System-

block

System-

block

System-

block

Technische Universität Dresden

Page 49: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Power Consumption Monitoring

Durchsatz-

komponente

HPC-

Komponente

Flexibles

Agiles

Speicher

System

HRSK-II Konzept

Block 1

ZIH-Infrastruktur

Block 1

Block N

HPC

Batch-System

Login

E/A-Analysen

Energieeffizienz-

Analysen

Optimierung

Steuerung

Zugriffe

Verb

rauch

Ste

ueru

ng

Optim

ieru

ng

High Precision

High Frequency

From complete system down to single nodes

Page 50: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Combination of technologies to meet varieties of demands

Users are transparently mapped to different storage pools

Storage infrastructure supports continuous monitoring

Analysis of usage statistics and patterns allows optimization of data location and targeted user support

Characterization of different performance demand:

Basics of storage concept: FASS

Name Minimum Size BW GiB/s

sustained

Random

E/A-Ops/s File creates/s

Checkpoint 1-2 PiB 70-100 ca. 1.000 ca. 5.000

Scratch 3-5 PiB ca. 20 ca. 10.000 ca. 5.000

Transactions Main Memory 70-100 ca. 100.000 ca. 50.000

Page 51: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Storage HPC-Component

Throughput-

Component

Architecture of storage concept (FASS)

Batch-System

Login

Access

statistics

Flexible Storage System (FASS)

User A

User Z

User A

… Server/File systems

User Z

Net

SSD

SAS

SATA

Server 1

Server 2

Server N

Sw

itch

2

Sw

itch

1

Analysis

Optimization and control

Transaction

Checkpoint.

Scratch

Export

ZIH-Infrastructure

SC

RA

TC

H

Page 52: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Location

1. phase in current machine room

– 3.500.000

– <100 m2

– <300 kW

2. phase in new location

– 11.500.000

– <600 m2

– <2.5 MW

Page 53: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Zeitplan Betrieb

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Phase Date

RFP September 2011

Contract March 2012

1. Phase June 2012

2. Phase October 2013

Page 54: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Research at ZIH

Selected Projects and Activities

Page 55: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Forschungsbereiche am ZIH

Software-Werkzeuge zur Unterstützung von Programmierung und Optimierung

Programmiermethoden und Techniken für Hochleistungsrechner

Grid-Computing

Mathematische Methoden, Algorithmen und effiziente Implementierungen

Architektur und Leistungsanalyse von Hochleistungsrechnern

Algorithmen und Methoden zur Modellierung biologischer Prozesse

Slide 55 LARS: Introduction and Motivation

Page 56: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Software-Werkzeuge …

Vampir

– Visualisierung und Analyse von parallelen Anwendungen

Marmot/MUST

– Erkennung von fehlerhafter Nutzung der MPI Kommunikationsbibliothek

ParBench

– Analyse von Multiprogramming Eigenschaften

BenchIT

– Ausführung/Archivierung/Darstellung von Benchmarks und deren Ergebnisse

Screenshots: Marmot for Windows

Slide 56 LARS: Introduction and Motivation

Page 57: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Vampir: Framework

Slide 57 LARS: Introduction and Motivation

Page 58: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Vampir: Timelines

Slide 58 LARS: Introduction and Motivation

Page 59: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Vampir: Summaries

Slide 59 LARS: Introduction and Motivation

Page 60: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

BenchIT

BenchIT measurement core

Command line interface

GUI

Website

Slide 60 LARS: Introduction and Motivation

Page 61: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Cluster Challenge 2008

Herausforderung:

– 6 Studenten

– 44 Stunden

– 1 (selbst zusammengestellter) Cluster mit max. 3,1 kW Leistungsaufnahme

– 5 wissenschaftliche Anwendungen

Ziel:

– Maximaler Durchsatz an Jobs innerhalb der Wettkampfzeit

Teilnehmerfeld:

Purdue University mit SiCortex, Univerity of Alberta mit SGI, TUD/IU mit IBM & Myricom, Taiwan mit HP, Arizona State mit Cray/MS, Colorado mit Aspen Systems, MIT mit Dell

Slide 61 LARS: Introduction and Motivation

Page 62: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Cluster Challenge 2008

Slide 62 LARS: Introduction and Motivation

Page 63: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Cluster Challenge 2008

Hardware-Optimierungen

– 10G Myrinet Interconnect (~120W für Switch + Host Adapter)

– Optimale DIMM Konfiguration für die Anwendungen (16 GB pro Knoten)

– Booten von USB-Sticks und Nutzen der lokalen Platten nur wenn nötig

– Bestimmen der Stromverbrauchsprofile der Anwendungen, um die “richtige” Gesamtknotenzahl zu wählen

Software-Optimierungen

– Wo sinnvoll, Einsatz kommerzieller Compiler (signifikanter Aufwand)

– Tracing der Anwendungen, um Kommunikation zu verstehen und zu optimieren

Durchsatz-Optimierungen

– Nutzen der Stromverbrauchs- und Laufzeitabschätzungen zur optimalten Auslastung des Clusters

Ergebnis: 1. Platz

Slide 63 LARS: Introduction and Motivation

Page 64: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Cluster Challenge 2008

Slide 64 LARS: Introduction and Motivation

Page 65: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Das ZIH als Arbeitgeber

Page 66: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Infrastruktur

Hochleistungsrechner:

Arbeitsplätze:

Slide 66 LARS: Introduction and Motivation

Page 67: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Internationale Zusammenarbeit

Tracing

ParMA

VI HPS

Open MPI

Slide 67 LARS: Introduction and Motivation

Page 68: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Zukunftsaussichten

In der Many-Core Ära wird paralleles Rechnen immer wichtiger

Kontakte zu internationalen Partnern

Industriekontakte: IBM, SUN, Cray, SGI, NEC; Intel, AMD, …

Mögliche Auslandsaufenthalte oder Industrieinternships

– Beispiele für Auslandsaufenthalte

• LLNL, CA, U.S.A.

• BSC, Barcelona, Spain

• Eugene, OR, U.S.A.

• ORNL, U.S.A.

– Beispiele für Internships:

• Cray

• IBM

Slide 68 LARS: Introduction and Motivation

Page 69: Performance Analysis of Computer Systems - TU Dresden · – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Thank you!

Hope to see you next time…