center for information services and high ... - tu dresden

Nöthnitzer Straße 46

Raum 1026

Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected])

Matthias S. Mueller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis of Computer Systems

3. Nov. 2011

Holger Brunst, Matthias Müller: Leistungsanalyse

Summary of Previous Lecture (1)

  Remarks: Doherty (1970)

Performance is the degree to which a computing system meets expectations

of the persons involved in it.

  Main objective: Get highest performance for a given cost

  System:

An arbitrary collection of hardware, software, and firmware:

e.g. CPU, database, network of computers

  Metric:

A criteria used to evaluate the performance of a system:

e.g. response time, throughput, FLOPS

  Workload:

The overall sum of user requests to a system

e.g.: CPU workload: Collection of instructions to execute


Summary of Previous Lecture (2)

  Discussion of performance analysis examples and questions

–  Selection of technique, metric, and workload

–  Correctness of performance measurements

–  Measurement and simulation design

  The art of performance analysis

–  Successful evaluation cannot be produced mechanically

–  Evaluation requires detailed knowledge of the system to be modeled

10 steps for systematic performance evaluation

1.  State goals

2.  List services and outcomes

3.  Select metrics

4.  List parameters that affect performance

5.  Select factors to study

6.  Select technique for evaluation

7.  Select workload

8.  Design experiments

9.  Analyze and interpret data

10.  Present results



Summary of Previous Lecture: Questions

  What does performance mean?

  What are the main reasons to do a performance analysis?

  What are the main tasks?

  What’s a system in performance analysis terminology?

  What do the terms metric and workload stand for?

  What’s a performance parameter?

  What’s a performance factor?


Raum 1026

Tel. +49 351 - 463 - 35048




Parallel Metrics


Excursion on Speedup and Efficiency Metrics

  Comparison of sequential and parallel algorithms

  Speedup:

–  n is the number of processors

–  T1 is the execution time of the sequential algorithm

–  Tn is the execution time of the parallel algorithm with n processors

  Efficiency:

–  Its value estimates how well-utilized p processors solve a given problem

–  Usually between zero and one. Exception: Super linear speedup (later)

!

Sn

=T1

Tn

!

Ep =Sp

p


Amdahl’s Law

  Find the maximum expected improvement to an overall system when only part of the system is improved

  Serial execution time = s+p

  Parallel execution time = s+p/n

–  Normalizing with respect to serial time (s+p) = 1 results in:

•  Sn = 1/(s+p/n)

–  Drops off rapidly as serial fraction increases

–  Maximum speedup possible = 1/s, independent of n the number of processors!

  Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors?

  What is wrong with this argument?

!

Sn =s+ p

s+p

n


Scaled Speedup (Gustafson-Barsis’ Law)

  Amdahl’s speedup equation assumes p is independent of n, in other words

the problem size remains the same

  Gustafson-Barsis’ law states that any sufficiently large problem can be

efficiently parallelized

  More realistic to assume “runtime” remains the same, NOT the problem size

  If the problem size scales up, does the serial part also increase?

  Parallel execution time = s+p

  Serial execution time = s+np

–  Normalizing with respect to parallel execution time results in:

–  Ssn = n+(1-n) s = p(n-1) + 1 !

Ssn =s+ pn

s+ p


Efficiency and Serial Fraction

  Strong scalability vs. weak scalability

  En = Sn/n, does not tell the whole story

–  is it necessarily bad if efficiency drops as you increase n for a given

problem size?

  s is supposed to be a constant

–  this assumes work is load balanced

–  no overhead for synchronizing the processors

  Experimentally measure the serial fraction

–  if s does not remain constant, what can we discern?


Superlinear/Superunitary Speedup

  Work in algorithm = Wreal+Wovhd

  What is Wovhd?

  Super-unitary speedup possible if total work done by n processors is strictly

less than that done by a single processor

  Reasons for super-unitary speedup

–  Memory and cache effects

–  Dividing up resource management overheads

–  Hiding latency for remote operations

–  Randomized algorithms

  In literature superlinear speedup is sometime also referred to us super-

unitary speedup which might be mathematically more correct


Raum 1026

Tel. +49 351 - 463 - 35048




System under Test

System under Test


Application

C++ Fortran C

MPI OpenMP

Hardware

Compiler

Runtime

Linux Windows OS

Code Size of HPC Software relative to other Systems

Software Lines of Code Person

Years

Windows NT 3.1 ~4.500.000 900

Linux Kernel 2.6.0 ~5.200.000 1040

Lustre >500.000 100

Open MPI 1.3.3 ~525.000 105

Open 64 compiler 4.2.1 ~1.139.000 227

HPCC ~50.000 10

VampirServer+VampirClient ~300.000 60

VampirTrace ~80.000 16

Marmot ~65.000 13

Compare different Compilers with SPEC OMPM2001

Code Tuning: different compiler flags

Result of disk performance tests

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60#nodes in each classDisk Speed [MB/s]Disktest on 622 Nodesavg: 45.94max: 66.30min: 2.10

Result of one SPEC OMPM application

–  Histogram of 320.equake runtime on dual CPU nodes

–  Sharp distribution indicates a healthy execution environment

Result of one SPEC OMPM application

–  Histogram of 310.wupwise runtime on dual CPU nodes

–  Shows huge variation in runtime

–  Problem identified as BIOS bug


Raum 1026

Tel. +49 351 - 463 - 35048




Workload types, selection and characterization


Types of Workloads

  Test workload:

–  Any workload used in performance studies

–  Real or synthetic

  Real workload:

–  Observed on a system being used for normal operation

–  Cannot be repeated

–  May contain sensitive data

  Synthetic workload:

–  Should be representative for a real workload

–  Often smaller in size


Historical examples for test workloads

  Addition instruction

  Instruction mixes

  Kernels

  Synthetic programs

  Application benchmarks


Popular benchmarks: Eratosthenes sieve algorithm

  Algorithm to find prime numbers

  Kernel

  Simple

  An algorithm is always independent of a computer language or specific

implementation

  No very representative of today's use of computers


Popular benchmarks: Ackermann’s Function

  Ackermann(n,m) := n+1 if m=0

Ackermann(m-1,1) if n=0

Ackermann(m-1, Ackermann(m,n-1))

  Used to assess the efficiency of procedure calls

  Ackermann(3,n) requires

(512*4**(n-1)-15*2**(n+3)+9*n+37)/3 calls and

a stack size 2**(n+3)-4


Popular benchmarks: Whetstone

  Used at British Central Computer Agency

  11 modules

  Representative f 949 ALGOL programs

  Available in ALGOL, FORTRAN, PL/I and other programs

  See Curnow and Wichmann (1975)

  Results in KWHIPS (Kilo Whetstone Instructions Per Second)

  Workloads characteristics:

–  Floating point intensive

–  Cache friendly

–  No I/O


Popular benchmarks: LINPACK

  Developed by Jack Dongarra (1983) at ANL (now ICL, UTK)

  Solves a dense system of linear equations

  Algorithmic definition of the benchmark

  Reference implementation available (HPL)

  Makes have use of BLAS

  One fixed dataset: 100x100

  Used as the benchmark for the TOP500 list

  Many vendors have its own hand-tuned implementation


Popular benchmarks: Dhrystone

  Developed in 1984 by Reinhold Weicker at Siemens

  Represents systems programming environments

  Available in C, Pascal and Ada

  Results are in Dhrystone Instructions Per Seconds (DIPS)

  Includes ground rules for building and executing Dhrystone (run rules)


Popular Benchmarks: Lawrence Livermore Loops

  24 separate tests

  Largely vectorizable

  Assembled at LLNL (see McMahon 1986)


Popular Benchmarks: Transaction Processing (TPC-C)

  Successor of the Debit-Credit Benchmark

  TPC-C is an on-line transaction processing benchmark

  Results reports performance (tpmC) and price/performance ($/tmpC)

  System reported has to be available to the customer (at that price)

  Running the benchmarks requires a costly setup:


SPEC groups and benchmarks

  Open Systems Group (desktop systems, high-end workstations and servers)

–  CPU (CPU benchmarks)

–  JAVA (java client and server side benchmarks)

–  MAIL (mail server benchmarks)

–  SFS (file server benchmarks)

–  WEB (web Server benchmarks)

  High Performance Group (HPC systems)

–  OMP (OpenMP benchmark)

–  HPC (HPC application benchmark)

–  MPI (MPI application benchmark)

  Graphics Performance Groups (Graphics)

–  Apc (Graphics application benchmarks)

–  Opc (OpenGL performance benchmarks)


Raum 1026

Tel. +49 351 - 463 - 35048




Workload Selection


System under Study

  Seems to be an easy thing to define

  Be aware of different abstraction layers

  Example ISO/OSI reference model for computer networks:

1.  Application (mail, FTP)

2.  Presentation (Data compression, ..)

3.  Session (Dialogs)

4.  Transport (Messages)

5.  Network (Packets)

6.  Datalink (Frames)

7.  Physical (Bits)


Level of Detail of the workload description

  Examples:

–  Most frequent request (e.g. Addition)

–  Frequency of request type (instruction mix)

–  Time-stamped sequence of requests

–  Average resource demand (e.g. 20 I/O requests per second)

–  Distribution of resource demands (not only the average, but also

probability distribution)


Representativeness

  After all benchmarks are not a merit of their own, they should represent real

workloads:

  Different characteristics to consider:

–  Arrival rate of requests

–  Resource demands

–  Resource usage profile (sequence and amounts of resources used by an

application)

  To be representative a test workload has to follow the user behavior in a

timely fashion!!!


Raum 1026

Tel. +49 351 - 463 - 35048



Center for Information Services and High Performance Computing (ZIH) Center for Information Services and High Performance Computing (ZIH)

SPEC Benchmarks

Vorlesung Leistungsanalyse


Outline

  What is SPEC?

  Who is SPEC?

  Some SPEC benchmarks:

–  SPEC CPU

–  SPEC HPC

–  SPEC OMP

–  SPEC MPI

  Summary


Raum 1026

Tel. +49 351 - 463 - 35048




What and who is SPEC?


What is SPEC?

  The Standard Performance Evaluation Corporation (SPEC) is a non-profit

corporation formed to establish, maintain and endorse a standardized set of

relevant benchmarks that can be applied to the newest generation of high-

performance computers. SPEC develops suites of benchmarks and also

reviews and publishes submitted results from our member organizations and

other benchmark licensees.

  For more details see http://www.spec.org


SPEC Members

  SPEC Members:

  3DLabs * Acer Inc. * Advanced Micro Devices * Apple Computer, Inc. * ATI Research * Azul Systems, Inc. * BEA Systems * Borland * Bull S.A. * CommuniGate Systems * Dell * EMC * Exanet * Fabric7 Systems, Inc. * Freescale Semiconductor, Inc. * Fujitsu Limited * Fujitsu Siemens * Hewlett-Packard * Hitachi Data Systems * Hitachi Ltd. * IBM * Intel * ION Computer Systems * JBoss * Microsoft * Mirapoint * NEC - Japan * Network Appliance * Novell * NVIDIA * Openwave Systems * Oracle * P.A. Semi * Panasas * PathScale * The Portland Group * S3 Graphics Co., Ltd. * SAP AG * SGI * Sun Microsystems * Super Micro Computer, Inc. * Sybase * Symantec Corporation * Unisys * Verisign * Zeus Technology *

  SPEC Associates:

  California Institute of Technology * Center for Scientific Computing (CSC) * Defence Science and Technology Organisation - Stirling * Dresden University of Technology * Duke University * JAIST * Kyushu University * Leibniz Rechenzentrum - Germany * National University of Singapore * New South Wales Department of Education and Training * Purdue University * Queen's University * Rightmark * Stanford University * Technical University of Darmstadt * Texas A&M University * Tsinghua University * University of Aizu - Japan * University of California - Berkeley * University of Central Florida * University of Illinois - NCSA * University of Maryland * University of Modena * University of Nebraska, Lincoln * University of New Mexico * University of Pavia * University of Stuttgart * University of Texas at Austin * University of Texas at El Paso * University of Tsukuba * University of Waterloo * VA Austin Automation Center *


SPEC groups

  Open Systems Group (desktop systems, high-end workstations and servers)

–  CPU (CPU benchmarks)

–  JAVA (java client and server side benchmarks)

–  MAIL (mail server benchmarks)

–  SFS (file server benchmarks)

–  WEB (web Server benchmarks)

  High Performance Group (HPC systems)

–  OMP (OpenMP benchmark)

–  HPC (HPC application benchmark)

–  MPI (MPI application benchmark)

  Graphics Performance Groups (Graphics)

–  Apc (Graphics application benchmarks)

–  Opc (OpenGL performance benchmarks)


SPEC HPG = SPEC High-Performance Group

  Founded in 1994

  Mission: To establish, maintain, and endorse a suite of

benchmarks that are representative of real-world high-

performance computing applications.

  SPEC/HPG includes members from both industry and academia.

  Benchmark products:

–  SPEC OMP (OMPM2001, OMPL2001)

–  SPEC HPC2002 released at SC 2002

–  SPEC MPI (under development)


Currently active SPEC HPG Members

  Fujitsu

  HP

  IBM

  Intel

  SGI

  SUN

  UNISYS

  University of Purdue

  Technische Universität Dresden


HPG (High Performance Group) Benchmark Suites

OMPL2001

Founding of SPEC HPG

HPC96

OMP2001

HPC2002

MPI2007

Jan 1994 1996 June 2001 June 2002 Jan 2003 2007


Raum 1026

Tel. +49 351 - 463 - 35048




Overview and Positioning


Where is SPEC Relative to Other Benchmarks ?   There are many metrics, each one has its purpose

Raw machine performance: Tflops

Microbenchmarks: Stream

Algorithmic benchmarks: Linpack

Compact Apps/Kernels: NAS benchmarks

Application Suites: SPEC

User-specific applications: Custom benchmarks

Computer Hardware

Applications


Why do we need benchmarks?

  Identify problems: measure machine properties

  Time evolution: verify that we make progress

  Coverage:

Help the vendors to have representative codes:

–  Increase competition by transparency

–  Drive future development (see SPEC CPU2000)

  Relevance:

Help the customers to choose the right computer


Comparison of different benchmark classes

coverage relevance Identify

problems

Time

evolution

Micro 0 0 ++ +

Algorithmic - 0 + ++

Kernels 0 0 + +

SPEC + + + +

Apps - ++ 0 0


Raum 1026

Tel. +49 351 - 463 - 35048




SPEC CPU 2006 From John Henning’s talk at SPEC Workshop

June 2007, Dresden


SPEC CPU2006 History

  Released August 2006

  Replaces CPU2000 (retired February 2007)

  5th CPU benchmark

–  SPECmark (later called “CPU89”)

–  SPEC92 (later called “CPU92”)

–  CPU95

–  CPU2000

–  CPU2006

  Note: these updates are required to stay representative

  Question to the audience: What kind of application would you add?


CINT 2006

Benchmark L Application Area Brief Description 400.perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin,

MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).

401.bzip2 C Compression Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.

403.gcc C C-Compiler Based on gcc Version 3.2, generates code for Opteron. 429.mcf C Combinatorial Optim. Vehicle scheduling. Uses a network simplex algorithm (which is also

used in commercial products) to schedule public transport. 445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game. 456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile

HMMs) 458.sjeng C AI: chess A highly-ranked chess program that also plays several chess variants.

462.libquantum C Physics Quantum Comp. Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.

464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2

471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernet campus network.

473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.

483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents to other document types.


CFP 2006 (part I)

Benchmark Lang. Application Area Brief Description 410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow. 416.gamess Fortran Quantum Chemistry. Implements a wide range of quantum chemical computations. The SPEC

workload does self-consistent field calculations using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi- Configuration Self-Consistent Field

433.milc C Physics/QCD A gauge field generating program for lattice gauge theory with dynamical quarks.

434.zeusmp Fortran Physics / CFD ZEUS-MP is a computational fluid dynamics code developed at the Laboratory for Computational Astrophysics (NCSA, University of Illinois at Urbana-Champaign) for the simulation of astrophysical phenomena.

435.gromacs C, Fortran Biochemistry Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.

436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog numerical method

437.leslie3d Fortran Fluid Dynamics Computational Fluid Dynamics (CFD) using Large-Eddy Simulations with Linear-Eddy Model in 3D. Uses MacCormack Predictor-Corrector time integration

444.namd C++ Biology Molecular Dynamics Simulates biomolecular systems. Test case has 92,224 atoms of apolipoprotein A-I.

447.dealII C++ FE Analysis deal.II is a C++ library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non- constant coefficients.


CFP 2006 (part II)

Benchmark Language Application Area Brief Description 450.soplex C++ Linear Programming, Solves a linear program using a simplex algorithm and

sparse linear algebra. Test Optimization cases include railroad planning and military airlift models.

453.povray C++ Image Ray-tracing Image rendering. The testcase is a 1280x1024 anti- aliased image of a landscape with some abstract objects with textures using a Perlin noise function.

454.calculix C, F Structural Mechanics Finite element code for 3D structural applications. Uses the SPOOLES solver library.

459.GemsFDTD F Electromagnetics Solves Maxwell equations in 3D using finite-difference time-domain (FDTD) method.

465.tonto Fortran Quantum Chemistry An open source quantum chemistry package, using an object-oriented design in Fortran 95. The test case places a constraint on a molecular Hartree-Fock wavefunction calculation to better match experimental X-ray diffraction data.

470.lbm C Fluid Dynamics Implements the "Lattice-Boltzmann Method" to simulate incompressible fluids in 3D

481.wrf C,F Weather Weather modeling from scales of meters to thousands of kilometers. The test case is from a 30km area over 2 days.

482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University


Code growth


Metrics

  Speed

–  SPECint_base2006 (Required Base result)

–  SPECint2006 (Optional Peak result)

–  SPECfp_base2006 (Required Base result)

–  SPECfp2006 (Optional Peak result)

  Throughput

–  SPECint_rate_base2006 (Required Base result)

–  SPECint_rate2006 (Optional Peak result)

–  SPECfp_rate_base2006 (Required Base result)

–  SPECfp_rate2006 (Optional Peak result)


Speed Metric for Single Benchmark

  For each benchmark in suite, compute ratio vs. time on a reference system

–  A 1997 Sun system with 296 MHz UltraSPARC II

–  Similar but not identical to CPU2000 ref machine

  Example:

–  400.perlbench on a year 2006 iMac took 948 seconds

–  On the reference system, took 9770 seconds

–  SPECratio = 10.3 (9770/948)

–  If your workload looks like perl, you might find that this modern iMac

runs around 10x faster than a state-of-the-1997-art workstation.


Overall Speed Metric

  To obtain the overall speed metrics: geometric mean of the individual

SPECratios

  Why geometric mean?

  Because this is the best answer to the question

  “Without knowing how much time I will spend in text processing vs. network

mapping vs. compiling vs. video compression, please tell me about how

much faster this machine will be than the reference system.”


Motivation for Throughput Metric

  Differs from speed

  Stove analogy:

–  One big flame cooks one big pot with one hogshead in one hour

–  6 little flames cook 6 little pots, each holding one firkin, in 15 minutes

–  Which is better?

  Well, big flame does ~250 liters/hour; each little flame does only ~40 * 4 =

160 liters/hour


Throughput vs. Speed

  Big flame does ~250 liters/hour; each little flame does only ~40 * 4 = 160

liters/hour

  Alternatives:

–  If I only need to heat up an UNOPENED container holding 1 gallon of

soup, supper can be served most quickly if I put it on the big flame

–  If I need to heat up one butt of soup (=2 hogsheads), and if I can open

the container, I'd be better off using many small flames

  In IT business:

–  Processing one image in Photoshop or Gimp vs.

–  Rendering the next movie with thousands of pictures


CPU2006 Throughput Metric

  Formula:

the number of copies run * reference time for the benchmark / elapsed time

in seconds

  Example:

Sun Fire E25K runs 144 copies of 400.perlbench in1066 seconds:

144 * 9770 / 1066 = 1320


Summary of Metrics

  Two different kind of metrics

–  speed (single application turnaround)

–  rate (thoughput)

  Run rules make the different between base and peak

–  Base: conservative optimization, less freedom

–  Peak: more aggressive optimization, more freedom

  Tow benchmark sets SPECint and SPECfp

⇒ 23 = 8 different metrics

  If you look at the single application results you get:

⇒ 2*2*(12+17)=116 different metics


Example for Run Rules

  Base does not allow feedback directed optimization (still legal in peak)

  An unlimited number of flags may be set in base,

–  Why? Because flag counting is not worth arguing about.

–  For example, is -fast:np27 one flag, two, or three? Prove it.

–  What if it's -fast_np27 ?

–  What it it’s –fast np27 or –fast –np27 ?


SPEC CPU2000 Result


Raum 1026

Tel. +49 351 - 463 - 35048




Thank You!

center for information services and high ... - tu dresden

Documents