performance analysis of computer systems - tu dresden · – architecture and performance analysis...

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis of Computer Systems

Introduction

Organization

Lecture: Every Wednesday in INF E001 from 13:00 to 14:30

Labs: Every Thursday in INF E069 from 13:00 to 14:30

First Exercise: October 20st

– You need an account in the PC pools!!

All slides will be in English

Ten minute summary of last lecture at the beginning of each lecture

List of attendees

Slide 2 LARS: Introduction and Motivation

Class Material on the Web

Slides will be put on the web prior or shortly after each class

The slides from last year are still online.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1011/lars

Be aware of upgrades for this term.

– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1012/lars


Class Outline (tentative)

15 lectures with corresponding exercises

Class structure

– Introduction and motivation

– Performance requirements, metrics, and common evaluation mistakes

– Workload types, selection, and characterization

– Commonly used benchmarks

– Monitoring techniques

– Capacity planning for future systems

– Performance data presentation

– Summarizing measured data

– Regression models

– Experimental design

– Performance simulation and prediction

– Introduction to queuing theory


Literature

Raj Jain: The Art of Computer Systems Performance Analysis

John Wiley & Sons, Inc., 1991 (ISBN: 0-471-50336-3)

Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN:3-519-02144-7)

Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: 0-444-81518-x)




Introduction and Motivation

Why is Performance Analysis Important?

Overview

Development of hardware performance

Implications on application performance

Compute power at Technische Universität Dresden

Research at ZIH

Some advertising


Moore’s Law: 2X Transistors / “year”

“Cramming More Components onto Integrated Circuits”

Gordon Moore, Electronics, 1965

# on transistors / cost-effective integrated circuit double every N months (18 N 24)


Performance Development in TOP500


Extrapolation to Exascale

Matthias S. Müller

100 Pflop/s

10 Pflop/s

1 Eflop/s

100 Tflop/s

1 Pflop/s

1 Tflop/s

100 Gflop/s

10 Tflop/s

1 Gflop/s

100 Mfl /

10 Gflop/s

100 Mflop/s

Erich Strohmaier: Highlights of the 37th TOP500 List, ISC‘11

John Shalf (NERSC, LBNL)


Number of Cores per System is Increasing Rapidly

Total # of Cores in Top15

0

200000

400000

600000

800000

1000000

1200000

Ju

n 9

3

De

z 9

3

Ju

n 9

4

De

z 9

4

Ju

n 9

5

De

z 9

5

Ju

n 9

6

De

z 9

6

Ju

n 9

7

De

z 9

7

Ju

n 9

8

De

z 9

8

Ju

n 9

9

De

z 9

9

Ju

n 0

0

De

z 0

0

Ju

n 0

1

De

z 0

1

Ju

n 0

2

De

z 0

2

Ju

n 0

3

De

z 0

3

Ju

n 0

4

De

z 0

4

Ju

n 0

5

De

z 0

5

Ju

n 0

6

De

z 0

6

Ju

n 0

7

De

z 0

7

Ju

n 0

8

De

z 0

8

Pro

cesso

rs


Number of Cores per System is Increasing Rapidly


Cray XT5 (Jaguar) at Oak Ridge National Laboratory


IBM Roadrunner at Los Alamos National Laboratory

First computer to surpass the 1 Petaflop (250 FLOPS ) barrier

Installed at Los Alamos National Laboratories

Hybrid Architecture

13,824 AMD Opteron cores

116,640 IBM PowerXCell 8i cores

Costs: $120 Mio.


IBM BlueGene/P (JUGENE) at Research Centre Jülich

Number five in TOP 500

Installed at Forschungszentrum Jülich

72 Racks with 32 node cards x 32 compute cards (total 73728)

294,912 PowerPC 450, 850 MHz

144 TB main memory


K Computer System

Nr. 1 System in TOP500 (June 2011)

“K” means 10^16

>80,000 Processors

>640,000 Cores

10 MW power consumption

SPARC64 VIIIfx CPU

16 GB/node, 2 GB/core

Direct water cooling


What Kind of Know-How is Required for HPC?

Algorithms and methods

Performance Analysis

Programming (Paradigms and details of implementations)

Operation of supercomputers (network, infrastructure, service, support)


Challenges

Languages

– Fortran95, C/C++, Java,

– Also scripting languages!

Parallelization:

– MPI, OpenMP, CUDA/SILC, Threading

Network

– Ethernet, Infiniband, Myrinet, …

Scheduling

– Distributed components, job scheduling, process scheduling

System architecture

– Processors, memory hierarchy




Application Performance

From Modeling to Execution


Environment / Machine Room

Why is performance hard and energy efficiency harder?

– Both are vertical/cross cutting problems:

Matthias Mueller

Digital Logic Level

Microarchitectur Level

Hardware

Instruction Set Architecture Level

Operating System Machine Level

Assembly Language Level

Problem-oriented Language Level

Performance

Energy Efficiency

Design and Implementation

Algorithm

Mathematical Representation

Modeling

Short History of X86 CPUs

CPU Year Bit

Width

#Transistors Clock Structure L1 / L2 /L3

4004 1971 4 2300 740 kHz 10 micro

8008 1972 8 3500 500 kHz 10 micro

8086 1978 16 29.000 10 Mhz 3 micro

80286 1982 16 134.000 25 MHz 1.5 micro

80386 1985 32 275.000 33 Mhz 1 micro

80486 1989 32 1.200.000 50 MHz 0.8 micro 8K

Pentium I 1994 32 3.100.000 66 MHz 0.8 micro 8K

Pentium II 1997 32 7.500.000 300 MHz 0.35 micro 16K/512K*

Pentium III 1999 32 9.500.000 600 MHz 0.25 micro 16K/512K*

Pentium IV 2000 32 42.000.000 1.5 GHz 0.18 micro 8K/256K

P IV F 2005 64 2.8- 3.8 GHz 90 nm 16K/2MB

Core i7 2008 64 781.000.000 3.2 GHz 45 nm 32K/256K/8MB


Intel Nehalem

Released 2008

4 cores

781.000.000 transistors

45nm technology

32 K L1Data, 32K L1Instruction

256 K L2

8 MB shared L3 cache

Hyperthreading

3.2 GHz*4 cores*4 FLOPS/cycle = 51.2 Gflop/s peak

Integrated memory controller

QPI between processors


Nehalem Core

Execution

Units

Out-of-Order

Scheduling & Retirement

L2 Cache

& Interrupt Servicing

Instruction Fetch

& L1 Cache

Branch Prediction Instruction

Decode & Microcode

Paging

L1 Data Cache

Memory Ordering

& Execution


Potential factors limiting performance

“Peak performance”

Floating point units

Integer units

… any other feature of micro architecture

Bandwidth (L1,L2,L3, main memory, other cores, other nodes)

Latency (L1,L2,L3, main memory, other cores, other nodes)

Power consumption


Performance development in TOP500


Develops the rest of the system at CPU speed?

μProc 60%/yr. (2X/1.5yr)

DRAM 9%/yr. (2X/10 yrs) 1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory Performance Gap: (grows 50% / year)

Perform

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)


Performance Trends measured by SPECint

Source: Hennessy, Patterson: „Computer Architecture, a quantitative approach .


CPUint2006 development 2005 - 2009


Performance Trends measured by SPECint

2009

23%


CPUfp2006 development 1991 - 2009

CPU 95

Released 1995

602 results between 3/1991 and 1/2001

CPUfp2000

Released 2000


CPUfp2006

Released 2006


42%

33%

30%


Performance Trends over a 20 years life cycle


Performance Trends over a 20 years life cycle

Where is your

application?




Center of Information Services and HPC

A short introduction

HPC in Germany: Gauß-Allianz

Matthias S. Müller

Members:

– GCS, HLRN (RRZN, ZIB), RWTH,

TU Dresden, RZG, TU Darmstadt,

DWD, DKRZ, SCC

– G-CSC, PC^2, RRZE, DFN,

DESY, RRZK

Köln

RRZK

Responsibilities of ZIH

Providing infrastructure and qualified service for TU Dresden and Saxony

Research topics

– Architecture and performance analysis of High Performance Computers

– Programming methods and techniques for HPC systems

– Software tools to support programming and optimization

– Modeling algorithms of biological processes

– Mathematical models, algorithms, and efficient implementations

Role of mediator between vendors, developers, and users

Pick up and preparation of new concepts, methods, and techniques

Teaching and Education


Compute Server Infrastructure

HPC - Komponente

Hauptspeicher 6,5 TB PC - Farm

HPC - SAN

Festplatten - kapazität :

68 TB

PC - SAN

Festplatten - kapazität :

68 TB

PetaByte - Bandarchiv

Kapazität : 1 PB

8 GB / s 4 GB / s 4 GB / s

1 , 8 GB / s HPC-Component

– SGI® Altix® 4700

– 2048 of

– MonteCito Cores

– 6.5 TByte main memory

PC-Farm

System from Linux Networx

AMD opteron CPUs (dual core, 2.6 GHz)

728 boards with 2592 cores

Infiniband networks between the nodes


HPC-System: SGI Altix 4700 (Mars)

32 x 42U Racks

1024 x Sockets with Itanium2 Montecito Dual-

Core CPUs (1.6 GHz/9MB L3 Cache)

13 TFlop/s peak performance

11.9 TFlop/s linpack

6.5 TB shared memory


Linux Networx PC-Farm (Deimos)

– 26 water cooled racks (Knürr)

– 1296 AMD Opteron x85 Dual-Core CPUs (2,6 GHz)

– 728 compute nodes with 2 (384), 4 (232) or 8 (112) cores

– 2 Master- und 11 Lustre-Server

– 2 GB memory per core

– 68 TB SAN disc (RAID 6)

– Local scratch discs (70, 150, 290 GB)

– 2 4x-Infiniband Fabrics (MPI + I/O)

– OS: SuSE SLES 10

– Batch system: LSF

– Compiler: Pathscale, PGI, Intel, Gnu

– ISV-Codes: Ansys100, CFX, Fluent, Gaussian, LS-DYNA, Matlab, MSC


Computer Rooms – Extension to the Building


Performance of Supercomputers at ZIH

0,0001

0,001

0,01

0,1

1

10

100

1000

10000 T

FL

OP

S

Jahr

Cray T3E 28 GFlops

Platz 237 VP200-EX 472 MFlops

Platz 500

SGI Origin 2000 16,5 GFlops

Platz 236

SGI Origin 3800 85,4 GFlops

Platz 351

Rang 1

Rang 10 Rang 500

PC-Farm 10,88 TFlops

Platz 79

SGI Altix 11,9 TFlops

Platz 49

HRSK-II Stufe 1

HRSK-II Stufe 2 RWTH

300 TFlops

Matthias S. Müller

Folie 43

Beschaffung der Auswertekomponente des DataCenter

Michael Kluge

Megware 2011, 75 KW, 50 Tflop/s Peak 4 Racks, 5760 Kerne, 12 TiB Hauptspeicher 90 Knoten mit je 64 Kernen pro Knoten heterogener Speicherausbau 64-512 GiB pro Knoten

Linux Networks 2006, 250 KW, 13 Tflop/s Peak 26 Racks, 2576 Kerne, 5,5 TiB Hauptspeicher

HRSK/152

Folie 44

Auswertekomponente - Netzwerktopologie

Michael Kluge

QDR Leaf

QDR Leaf

QDR Leaf

QDR Spine

QDR Spine

QDR Leaf

QDR Leaf

deimos SDR Fabric

Lustre

20 Knoten

20 Knoten

20 Knoten

20 Knoten

20 Knoten

TU Dresden /HOME /BACKUP

TU Dresden Campus Ethernet (10Gbit/s) / DFN Uplink

724 Knoten

2x L

ogin

2x

Man

agem

ent

Eth

erne

t

Scratch

Folie 45

4 Sockel AMD Opteron 6276 Knoten

64 Kerne

– 2,3 GHz (Turbo bis 3,3 GHz)

– Dualcore Module • 16 KiB L1 pro Kern • 2 MiB L2 pro Modul • Eine FPU für beide Kerne

– 16 MiB (2x 8) L3 pro Prozessor • 12 MiB nutzbar, da 4 MiB für HT Assist

2 Chips pro Prozessor

– Verbunden über HyperTransport

– 8 NUMA Nodes in 4 Sockel System

Halbe HT 3.0 Links zwischen Sockeln

– je 12,8 GB/s (6,4 pro Richtung)

128 GiB DDR3-1333, 4 Kanäle pro Sockel

Daniel Molka

Folie 46

Auswertekomponente - Zahlen

Michael Kluge

pro Knoten Megware 2011 LNXI 2006

Kerne 64 (4 Chips) 2-8 (1-4 Chips)

Art der Kerne Opteron 6276 Opteron 285

Hauptspeicher 64-512 GiB 4-32 GiB

Peak Gleitkommaleistung 147 GF/s pro Chip 10,4 GF/s pro Chip

Peak Speicherbandbreite 35 GB/s pro Chip 6,4 GB/s pro Chip

Vernetzung QDR; 4 GB/s; 1μs SDR; 1 GB/s; 1,7μs

Betriebssystem SuSE SLES SuSE SLES

Gesamtsystem Megware 2011 LNXI 2006

Peak Gleitkommaleistung 53 TF/s 13 TF/s

Peak Speicherbandbreite 12 TB/s 8,2 TB/s

Bisektionsbandbreite 120 GB/s 60 GB/s

Batchsystem LSF LSF

Center for Information Services and High Performance Computing (ZIH)

HRSK-II

Plans for the next HPC infrastructure in Dresden

Architecture of future system (HRSK-II)

30 Gbit/s Cluster-Uplink

Durchsatzkomponente

HPC-Komponente

System-

block

Archiv

Home

Eth

ern

et

Eth

ern

et

ZIH-Backbone

HRSK-II-Router

HTW, Chemnitz, Leipzig, Freiberg

Erlangen, Potsdam

SAN-Schicht

Server

Bereich 1

SATA SAS FLASH

Disk Pool

ServerServer

Bereich 2HPC-Server

HRSK-II-Konzept

Flexibles Agiles Speicher System

System-

block

System-

block

System-

block

Technische Universität Dresden

Power Consumption Monitoring

Durchsatz-

komponente

HPC-

Komponente

Flexibles

Agiles

Speicher

System

HRSK-II Konzept

Block 1

ZIH-Infrastruktur

Block 1

Block N

…

HPC

Batch-System

Login

E/A-Analysen

Energieeffizienz-

Analysen

Optimierung

Steuerung

Zugriffe

Verb

rauch

Ste

ueru

ng

Optim

ieru

ng

High Precision

High Frequency

From complete system down to single nodes

Combination of technologies to meet varieties of demands

Users are transparently mapped to different storage pools

Storage infrastructure supports continuous monitoring

Analysis of usage statistics and patterns allows optimization of data location and targeted user support

Characterization of different performance demand:

Basics of storage concept: FASS

Name Minimum Size BW GiB/s

sustained

Random

E/A-Ops/s File creates/s

Checkpoint 1-2 PiB 70-100 ca. 1.000 ca. 5.000

Scratch 3-5 PiB ca. 20 ca. 10.000 ca. 5.000

Transactions Main Memory 70-100 ca. 100.000 ca. 50.000

Storage HPC-Component

Throughput-

Component

Architecture of storage concept (FASS)

Batch-System

Login

Access

statistics

Flexible Storage System (FASS)

User A

User Z

User A

… Server/File systems

…

User Z

…

Net

SSD

SAS

SATA

Server 1

Server 2

Server N

Sw

itch

2

Sw

itch

1

Analysis

Optimization and control

Transaction

Checkpoint.

Scratch

Export

ZIH-Infrastructure

SC

RA

TC

H

Location

1. phase in current machine room

– 3.500.000

– <100 m2

– <300 kW

2. phase in new location

– 11.500.000

– <600 m2

– <2.5 MW

Zeitplan Betrieb

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Phase Date

RFP September 2011

Contract March 2012

1. Phase June 2012

2. Phase October 2013



Research at ZIH

Selected Projects and Activities

Forschungsbereiche am ZIH

Software-Werkzeuge zur Unterstützung von Programmierung und Optimierung

Programmiermethoden und Techniken für Hochleistungsrechner

Grid-Computing

Mathematische Methoden, Algorithmen und effiziente Implementierungen

Architektur und Leistungsanalyse von Hochleistungsrechnern

Algorithmen und Methoden zur Modellierung biologischer Prozesse


Software-Werkzeuge …

Vampir

– Visualisierung und Analyse von parallelen Anwendungen

Marmot/MUST

– Erkennung von fehlerhafter Nutzung der MPI Kommunikationsbibliothek

ParBench

– Analyse von Multiprogramming Eigenschaften

BenchIT

– Ausführung/Archivierung/Darstellung von Benchmarks und deren Ergebnisse

Screenshots: Marmot for Windows


Vampir: Framework


Vampir: Timelines


Vampir: Summaries


BenchIT

BenchIT measurement core

Command line interface

GUI

Website


Cluster Challenge 2008

Herausforderung:

– 6 Studenten

– 44 Stunden

– 1 (selbst zusammengestellter) Cluster mit max. 3,1 kW Leistungsaufnahme

– 5 wissenschaftliche Anwendungen

Ziel:

– Maximaler Durchsatz an Jobs innerhalb der Wettkampfzeit

Teilnehmerfeld:

Purdue University mit SiCortex, Univerity of Alberta mit SGI, TUD/IU mit IBM & Myricom, Taiwan mit HP, Arizona State mit Cray/MS, Colorado mit Aspen Systems, MIT mit Dell



Hardware-Optimierungen

– 10G Myrinet Interconnect (~120W für Switch + Host Adapter)

– Optimale DIMM Konfiguration für die Anwendungen (16 GB pro Knoten)

– Booten von USB-Sticks und Nutzen der lokalen Platten nur wenn nötig

– Bestimmen der Stromverbrauchsprofile der Anwendungen, um die “richtige” Gesamtknotenzahl zu wählen

Software-Optimierungen

– Wo sinnvoll, Einsatz kommerzieller Compiler (signifikanter Aufwand)

– Tracing der Anwendungen, um Kommunikation zu verstehen und zu optimieren

Durchsatz-Optimierungen

– Nutzen der Stromverbrauchs- und Laufzeitabschätzungen zur optimalten Auslastung des Clusters

Ergebnis: 1. Platz




Das ZIH als Arbeitgeber

Infrastruktur

Hochleistungsrechner:

Arbeitsplätze:


Internationale Zusammenarbeit

Tracing

ParMA

VI HPS

Open MPI


Zukunftsaussichten

In der Many-Core Ära wird paralleles Rechnen immer wichtiger

Kontakte zu internationalen Partnern

Industriekontakte: IBM, SUN, Cray, SGI, NEC; Intel, AMD, …

Mögliche Auslandsaufenthalte oder Industrieinternships

– Beispiele für Auslandsaufenthalte

• LLNL, CA, U.S.A.

• BSC, Barcelona, Spain

• Eugene, OR, U.S.A.

• ORNL, U.S.A.

– Beispiele für Internships:

• Cray

• IBM




Thank you!

Hope to see you next time…

performance analysis of computer systems - tu dresden · – architecture and performance analysis...

Documents