performance analysis of computer systems - tu dresden · – architecture and performance analysis...
TRANSCRIPT
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Performance Analysis of Computer Systems
Introduction
Organization
Lecture: Every Wednesday in INF E001 from 13:00 to 14:30
Labs: Every Thursday in INF E069 from 13:00 to 14:30
First Exercise: October 20st
– You need an account in the PC pools!!
All slides will be in English
Ten minute summary of last lecture at the beginning of each lecture
List of attendees
Slide 2 LARS: Introduction and Motivation
Class Material on the Web
Slides will be put on the web prior or shortly after each class
The slides from last year are still online.
– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1011/lars
Be aware of upgrades for this term.
– http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/lehre/ws1012/lars
Slide 3 LARS: Introduction and Motivation
Class Outline (tentative)
15 lectures with corresponding exercises
Class structure
– Introduction and motivation
– Performance requirements, metrics, and common evaluation mistakes
– Workload types, selection, and characterization
– Commonly used benchmarks
– Monitoring techniques
– Capacity planning for future systems
– Performance data presentation
– Summarizing measured data
– Regression models
– Experimental design
– Performance simulation and prediction
– Introduction to queuing theory
Slide 4 LARS: Introduction and Motivation
Literature
Raj Jain: The Art of Computer Systems Performance Analysis
John Wiley & Sons, Inc., 1991 (ISBN: 0-471-50336-3)
Rainer Klar, Peter Dauphin, Fran Hartleb, Richard Hofmann, Bernd Mohr, Andreas Quick, Markus Siegle Messung und Modellierung paralleler und verteilter Rechensysteme B.G. Teubner Verlag, Stuttgart, 1995 (ISBN:3-519-02144-7)
Dongarra, Gentzsch, Eds.: Computer Benchmarks, Advances in Parallel Computing 8, North Holland, 1993 (ISBN: 0-444-81518-x)
Slide 5 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Introduction and Motivation
Why is Performance Analysis Important?
Overview
Development of hardware performance
Implications on application performance
Compute power at Technische Universität Dresden
Research at ZIH
Some advertising
Slide 7 LARS: Introduction and Motivation
Moore’s Law: 2X Transistors / “year”
“Cramming More Components onto Integrated Circuits”
Gordon Moore, Electronics, 1965
# on transistors / cost-effective integrated circuit double every N months (18 N 24)
Slide 8 LARS: Introduction and Motivation
Performance Development in TOP500
Slide 9 LARS: Introduction and Motivation
Extrapolation to Exascale
Matthias S. Müller
100 Pflop/s
10 Pflop/s
1 Eflop/s
100 Tflop/s
1 Pflop/s
1 Tflop/s
100 Gflop/s
10 Tflop/s
1 Gflop/s
100 Mfl /
10 Gflop/s
100 Mflop/s
Erich Strohmaier: Highlights of the 37th TOP500 List, ISC‘11
John Shalf (NERSC, LBNL)
Slide 11 LARS: Introduction and Motivation
Number of Cores per System is Increasing Rapidly
Total # of Cores in Top15
0
200000
400000
600000
800000
1000000
1200000
Ju
n 9
3
De
z 9
3
Ju
n 9
4
De
z 9
4
Ju
n 9
5
De
z 9
5
Ju
n 9
6
De
z 9
6
Ju
n 9
7
De
z 9
7
Ju
n 9
8
De
z 9
8
Ju
n 9
9
De
z 9
9
Ju
n 0
0
De
z 0
0
Ju
n 0
1
De
z 0
1
Ju
n 0
2
De
z 0
2
Ju
n 0
3
De
z 0
3
Ju
n 0
4
De
z 0
4
Ju
n 0
5
De
z 0
5
Ju
n 0
6
De
z 0
6
Ju
n 0
7
De
z 0
7
Ju
n 0
8
De
z 0
8
Pro
cesso
rs
Slide 12 LARS: Introduction and Motivation
Number of Cores per System is Increasing Rapidly
Slide 13 LARS: Introduction and Motivation
Cray XT5 (Jaguar) at Oak Ridge National Laboratory
Slide 14 LARS: Introduction and Motivation
IBM Roadrunner at Los Alamos National Laboratory
First computer to surpass the 1 Petaflop (250 FLOPS ) barrier
Installed at Los Alamos National Laboratories
Hybrid Architecture
13,824 AMD Opteron cores
116,640 IBM PowerXCell 8i cores
Costs: $120 Mio.
Slide 15 LARS: Introduction and Motivation
IBM BlueGene/P (JUGENE) at Research Centre Jülich
Number five in TOP 500
Installed at Forschungszentrum Jülich
72 Racks with 32 node cards x 32 compute cards (total 73728)
294,912 PowerPC 450, 850 MHz
144 TB main memory
Slide 16 LARS: Introduction and Motivation
K Computer System
Nr. 1 System in TOP500 (June 2011)
“K” means 10^16
>80,000 Processors
>640,000 Cores
10 MW power consumption
SPARC64 VIIIfx CPU
16 GB/node, 2 GB/core
Direct water cooling
Slide 17 LARS: Introduction and Motivation
What Kind of Know-How is Required for HPC?
Algorithms and methods
Performance Analysis
Programming (Paradigms and details of implementations)
Operation of supercomputers (network, infrastructure, service, support)
Slide 18 LARS: Introduction and Motivation
Challenges
Languages
– Fortran95, C/C++, Java,
– Also scripting languages!
Parallelization:
– MPI, OpenMP, CUDA/SILC, Threading
Network
– Ethernet, Infiniband, Myrinet, …
Scheduling
– Distributed components, job scheduling, process scheduling
System architecture
– Processors, memory hierarchy
Slide 19 LARS: Introduction and Motivation
From Modeling to Execution
Slide 21 LARS: Introduction and Motivation
Environment / Machine Room
Why is performance hard and energy efficiency harder?
– Both are vertical/cross cutting problems:
Matthias Mueller
Digital Logic Level
Microarchitectur Level
Hardware
Instruction Set Architecture Level
Operating System Machine Level
Assembly Language Level
Problem-oriented Language Level
Performance
Energy Efficiency
Design and Implementation
Algorithm
Mathematical Representation
Modeling
Short History of X86 CPUs
CPU Year Bit
Width
#Transistors Clock Structure L1 / L2 /L3
4004 1971 4 2300 740 kHz 10 micro
8008 1972 8 3500 500 kHz 10 micro
8086 1978 16 29.000 10 Mhz 3 micro
80286 1982 16 134.000 25 MHz 1.5 micro
80386 1985 32 275.000 33 Mhz 1 micro
80486 1989 32 1.200.000 50 MHz 0.8 micro 8K
Pentium I 1994 32 3.100.000 66 MHz 0.8 micro 8K
Pentium II 1997 32 7.500.000 300 MHz 0.35 micro 16K/512K*
Pentium III 1999 32 9.500.000 600 MHz 0.25 micro 16K/512K*
Pentium IV 2000 32 42.000.000 1.5 GHz 0.18 micro 8K/256K
P IV F 2005 64 2.8- 3.8 GHz 90 nm 16K/2MB
Core i7 2008 64 781.000.000 3.2 GHz 45 nm 32K/256K/8MB
Slide 23 LARS: Introduction and Motivation
Intel Nehalem
Released 2008
4 cores
781.000.000 transistors
45nm technology
32 K L1Data, 32K L1Instruction
256 K L2
8 MB shared L3 cache
Hyperthreading
3.2 GHz*4 cores*4 FLOPS/cycle = 51.2 Gflop/s peak
Integrated memory controller
QPI between processors
Slide 24 LARS: Introduction and Motivation
Nehalem Core
Execution
Units
Out-of-Order
Scheduling & Retirement
L2 Cache
& Interrupt Servicing
Instruction Fetch
& L1 Cache
Branch Prediction Instruction
Decode & Microcode
Paging
L1 Data Cache
Memory Ordering
& Execution
Slide 25 LARS: Introduction and Motivation
Potential factors limiting performance
“Peak performance”
Floating point units
Integer units
… any other feature of micro architecture
Bandwidth (L1,L2,L3, main memory, other cores, other nodes)
Latency (L1,L2,L3, main memory, other cores, other nodes)
Power consumption
Slide 26 LARS: Introduction and Motivation
Performance development in TOP500
Slide 27 LARS: Introduction and Motivation
Develops the rest of the system at CPU speed?
μProc 60%/yr. (2X/1.5yr)
DRAM 9%/yr. (2X/10 yrs) 1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory Performance Gap: (grows 50% / year)
Perform
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Slide 28 LARS: Introduction and Motivation
Performance Trends measured by SPECint
Source: Hennessy, Patterson: „Computer Architecture, a quantitative approach .
Slide 29 LARS: Introduction and Motivation
CPUint2006 development 2005 - 2009
Slide 30 LARS: Introduction and Motivation
Performance Trends measured by SPECint
2009
23%
Slide 31 LARS: Introduction and Motivation
CPUfp2006 development 1991 - 2009
CPU 95
Released 1995
602 results between 3/1991 and 1/2001
CPUfp2000
Released 2000
1385 results between 10/1996 and 2/2007
CPUfp2006
Released 2006
1217 results between 4/1997 and 4/2009
42%
33%
30%
Slide 32 LARS: Introduction and Motivation
Performance Trends over a 20 years life cycle
Slide 33 LARS: Introduction and Motivation
Performance Trends over a 20 years life cycle
Where is your
application?
Slide 34 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center of Information Services and HPC
A short introduction
HPC in Germany: Gauß-Allianz
Matthias S. Müller
Members:
– GCS, HLRN (RRZN, ZIB), RWTH,
TU Dresden, RZG, TU Darmstadt,
DWD, DKRZ, SCC
– G-CSC, PC^2, RRZE, DFN,
DESY, RRZK
Köln
RRZK
Responsibilities of ZIH
Providing infrastructure and qualified service for TU Dresden and Saxony
Research topics
– Architecture and performance analysis of High Performance Computers
– Programming methods and techniques for HPC systems
– Software tools to support programming and optimization
– Modeling algorithms of biological processes
– Mathematical models, algorithms, and efficient implementations
Role of mediator between vendors, developers, and users
Pick up and preparation of new concepts, methods, and techniques
Teaching and Education
Slide 37 LARS: Introduction and Motivation
Compute Server Infrastructure
HPC - Komponente
Hauptspeicher 6,5 TB PC - Farm
HPC - SAN
Festplatten - kapazität :
68 TB
PC - SAN
Festplatten - kapazität :
68 TB
PetaByte - Bandarchiv
Kapazität : 1 PB
8 GB / s 4 GB / s 4 GB / s
1 , 8 GB / s HPC-Component
– SGI® Altix® 4700
– 2048 of
– MonteCito Cores
– 6.5 TByte main memory
PC-Farm
System from Linux Networx
AMD opteron CPUs (dual core, 2.6 GHz)
728 boards with 2592 cores
Infiniband networks between the nodes
Slide 38 LARS: Introduction and Motivation
HPC-System: SGI Altix 4700 (Mars)
32 x 42U Racks
1024 x Sockets with Itanium2 Montecito Dual-
Core CPUs (1.6 GHz/9MB L3 Cache)
13 TFlop/s peak performance
11.9 TFlop/s linpack
6.5 TB shared memory
Slide 39 LARS: Introduction and Motivation
Linux Networx PC-Farm (Deimos)
– 26 water cooled racks (Knürr)
– 1296 AMD Opteron x85 Dual-Core CPUs (2,6 GHz)
– 728 compute nodes with 2 (384), 4 (232) or 8 (112) cores
– 2 Master- und 11 Lustre-Server
– 2 GB memory per core
– 68 TB SAN disc (RAID 6)
– Local scratch discs (70, 150, 290 GB)
– 2 4x-Infiniband Fabrics (MPI + I/O)
– OS: SuSE SLES 10
– Batch system: LSF
– Compiler: Pathscale, PGI, Intel, Gnu
– ISV-Codes: Ansys100, CFX, Fluent, Gaussian, LS-DYNA, Matlab, MSC
Slide 40 LARS: Introduction and Motivation
Computer Rooms – Extension to the Building
Slide 41 LARS: Introduction and Motivation
Performance of Supercomputers at ZIH
0,0001
0,001
0,01
0,1
1
10
100
1000
10000 T
FL
OP
S
Jahr
Cray T3E 28 GFlops
Platz 237 VP200-EX 472 MFlops
Platz 500
SGI Origin 2000 16,5 GFlops
Platz 236
SGI Origin 3800 85,4 GFlops
Platz 351
Rang 1
Rang 10 Rang 500
PC-Farm 10,88 TFlops
Platz 79
SGI Altix 11,9 TFlops
Platz 49
HRSK-II Stufe 1
HRSK-II Stufe 2 RWTH
300 TFlops
Matthias S. Müller
Folie 43
Beschaffung der Auswertekomponente des DataCenter
Michael Kluge
Megware 2011, 75 KW, 50 Tflop/s Peak 4 Racks, 5760 Kerne, 12 TiB Hauptspeicher 90 Knoten mit je 64 Kernen pro Knoten heterogener Speicherausbau 64-512 GiB pro Knoten
Linux Networks 2006, 250 KW, 13 Tflop/s Peak 26 Racks, 2576 Kerne, 5,5 TiB Hauptspeicher
HRSK/152
Folie 44
Auswertekomponente - Netzwerktopologie
Michael Kluge
QDR Leaf
QDR Leaf
QDR Leaf
QDR Spine
QDR Spine
QDR Leaf
QDR Leaf
deimos SDR Fabric
Lustre
20 Knoten
20 Knoten
20 Knoten
20 Knoten
20 Knoten
TU Dresden /HOME /BACKUP
TU Dresden Campus Ethernet (10Gbit/s) / DFN Uplink
724 Knoten
2x L
ogin
2x
Man
agem
ent
Eth
erne
t
Scratch
Folie 45
4 Sockel AMD Opteron 6276 Knoten
64 Kerne
– 2,3 GHz (Turbo bis 3,3 GHz)
– Dualcore Module • 16 KiB L1 pro Kern • 2 MiB L2 pro Modul • Eine FPU für beide Kerne
– 16 MiB (2x 8) L3 pro Prozessor • 12 MiB nutzbar, da 4 MiB für HT Assist
2 Chips pro Prozessor
– Verbunden über HyperTransport
– 8 NUMA Nodes in 4 Sockel System
Halbe HT 3.0 Links zwischen Sockeln
– je 12,8 GB/s (6,4 pro Richtung)
128 GiB DDR3-1333, 4 Kanäle pro Sockel
Daniel Molka
Folie 46
Auswertekomponente - Zahlen
Michael Kluge
pro Knoten Megware 2011 LNXI 2006
Kerne 64 (4 Chips) 2-8 (1-4 Chips)
Art der Kerne Opteron 6276 Opteron 285
Hauptspeicher 64-512 GiB 4-32 GiB
Peak Gleitkommaleistung 147 GF/s pro Chip 10,4 GF/s pro Chip
Peak Speicherbandbreite 35 GB/s pro Chip 6,4 GB/s pro Chip
Vernetzung QDR; 4 GB/s; 1μs SDR; 1 GB/s; 1,7μs
Betriebssystem SuSE SLES SuSE SLES
Gesamtsystem Megware 2011 LNXI 2006
Peak Gleitkommaleistung 53 TF/s 13 TF/s
Peak Speicherbandbreite 12 TB/s 8,2 TB/s
Bisektionsbandbreite 120 GB/s 60 GB/s
Batchsystem LSF LSF
Center for Information Services and High Performance Computing (ZIH)
HRSK-II
Plans for the next HPC infrastructure in Dresden
Architecture of future system (HRSK-II)
30 Gbit/s Cluster-Uplink
Durchsatzkomponente
HPC-Komponente
System-
block
Archiv
Home
Eth
ern
et
Eth
ern
et
ZIH-Backbone
HRSK-II-Router
HTW, Chemnitz, Leipzig, Freiberg
Erlangen, Potsdam
SAN-Schicht
Server
Bereich 1
SATA SAS FLASH
Disk Pool
ServerServer
Bereich 2HPC-Server
HRSK-II-Konzept
Flexibles Agiles Speicher System
System-
block
System-
block
System-
block
Technische Universität Dresden
Power Consumption Monitoring
Durchsatz-
komponente
HPC-
Komponente
Flexibles
Agiles
Speicher
System
HRSK-II Konzept
Block 1
ZIH-Infrastruktur
Block 1
Block N
…
HPC
Batch-System
Login
E/A-Analysen
Energieeffizienz-
Analysen
Optimierung
Steuerung
Zugriffe
Verb
rauch
Ste
ueru
ng
Optim
ieru
ng
High Precision
High Frequency
From complete system down to single nodes
Combination of technologies to meet varieties of demands
Users are transparently mapped to different storage pools
Storage infrastructure supports continuous monitoring
Analysis of usage statistics and patterns allows optimization of data location and targeted user support
Characterization of different performance demand:
Basics of storage concept: FASS
Name Minimum Size BW GiB/s
sustained
Random
E/A-Ops/s File creates/s
Checkpoint 1-2 PiB 70-100 ca. 1.000 ca. 5.000
Scratch 3-5 PiB ca. 20 ca. 10.000 ca. 5.000
Transactions Main Memory 70-100 ca. 100.000 ca. 50.000
Storage HPC-Component
Throughput-
Component
Architecture of storage concept (FASS)
Batch-System
Login
Access
statistics
Flexible Storage System (FASS)
User A
User Z
User A
… Server/File systems
…
User Z
…
Net
SSD
SAS
SATA
Server 1
Server 2
Server N
Sw
itch
2
Sw
itch
1
Analysis
Optimization and control
Transaction
Checkpoint.
Scratch
Export
ZIH-Infrastructure
SC
RA
TC
H
Location
1. phase in current machine room
– 3.500.000
– <100 m2
– <300 kW
2. phase in new location
– 11.500.000
– <600 m2
– <2.5 MW
Zeitplan Betrieb
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Phase Date
RFP September 2011
Contract March 2012
1. Phase June 2012
2. Phase October 2013
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Research at ZIH
Selected Projects and Activities
Forschungsbereiche am ZIH
Software-Werkzeuge zur Unterstützung von Programmierung und Optimierung
Programmiermethoden und Techniken für Hochleistungsrechner
Grid-Computing
Mathematische Methoden, Algorithmen und effiziente Implementierungen
Architektur und Leistungsanalyse von Hochleistungsrechnern
Algorithmen und Methoden zur Modellierung biologischer Prozesse
Slide 55 LARS: Introduction and Motivation
Software-Werkzeuge …
Vampir
– Visualisierung und Analyse von parallelen Anwendungen
Marmot/MUST
– Erkennung von fehlerhafter Nutzung der MPI Kommunikationsbibliothek
ParBench
– Analyse von Multiprogramming Eigenschaften
BenchIT
– Ausführung/Archivierung/Darstellung von Benchmarks und deren Ergebnisse
Screenshots: Marmot for Windows
Slide 56 LARS: Introduction and Motivation
Vampir: Framework
Slide 57 LARS: Introduction and Motivation
Vampir: Timelines
Slide 58 LARS: Introduction and Motivation
Vampir: Summaries
Slide 59 LARS: Introduction and Motivation
BenchIT
BenchIT measurement core
Command line interface
GUI
Website
Slide 60 LARS: Introduction and Motivation
Cluster Challenge 2008
Herausforderung:
– 6 Studenten
– 44 Stunden
– 1 (selbst zusammengestellter) Cluster mit max. 3,1 kW Leistungsaufnahme
– 5 wissenschaftliche Anwendungen
Ziel:
– Maximaler Durchsatz an Jobs innerhalb der Wettkampfzeit
Teilnehmerfeld:
Purdue University mit SiCortex, Univerity of Alberta mit SGI, TUD/IU mit IBM & Myricom, Taiwan mit HP, Arizona State mit Cray/MS, Colorado mit Aspen Systems, MIT mit Dell
Slide 61 LARS: Introduction and Motivation
Cluster Challenge 2008
Slide 62 LARS: Introduction and Motivation
Cluster Challenge 2008
Hardware-Optimierungen
– 10G Myrinet Interconnect (~120W für Switch + Host Adapter)
– Optimale DIMM Konfiguration für die Anwendungen (16 GB pro Knoten)
– Booten von USB-Sticks und Nutzen der lokalen Platten nur wenn nötig
– Bestimmen der Stromverbrauchsprofile der Anwendungen, um die “richtige” Gesamtknotenzahl zu wählen
Software-Optimierungen
– Wo sinnvoll, Einsatz kommerzieller Compiler (signifikanter Aufwand)
– Tracing der Anwendungen, um Kommunikation zu verstehen und zu optimieren
Durchsatz-Optimierungen
– Nutzen der Stromverbrauchs- und Laufzeitabschätzungen zur optimalten Auslastung des Clusters
Ergebnis: 1. Platz
Slide 63 LARS: Introduction and Motivation
Cluster Challenge 2008
Slide 64 LARS: Introduction and Motivation
Infrastruktur
Hochleistungsrechner:
Arbeitsplätze:
Slide 66 LARS: Introduction and Motivation
Internationale Zusammenarbeit
Tracing
ParMA
VI HPS
Open MPI
Slide 67 LARS: Introduction and Motivation
Zukunftsaussichten
In der Many-Core Ära wird paralleles Rechnen immer wichtiger
Kontakte zu internationalen Partnern
Industriekontakte: IBM, SUN, Cray, SGI, NEC; Intel, AMD, …
Mögliche Auslandsaufenthalte oder Industrieinternships
– Beispiele für Auslandsaufenthalte
• LLNL, CA, U.S.A.
• BSC, Barcelona, Spain
• Eugene, OR, U.S.A.
• ORNL, U.S.A.
– Beispiele für Internships:
• Cray
• IBM
Slide 68 LARS: Introduction and Motivation
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Thank you!
Hope to see you next time…