commercial system performance analysis at sun · workload characterization miss rate analysis...

47
Commercial System Performance Analysis at Sun Brian O'Krafka Sun Labs Performance and Architecture Group

Upload: others

Post on 19-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Commercial S ystemPerformance Anal ysis at Sun

    Brian O'KrafkaSun Labs Performance and

    Architecture Group

  • 4/28/03

    Commercial System Performance Analysis at Sun

    Introduction

    System ModelingTypical System ModelModeling Environment

    Workload characterizationMiss rate analysisProcessor abstraction using blocking factors

  • 4/28/03

    The ProblemTimely, accurate performance projections for Sun systems

    Why?early exploration of broad design spaceengineering tradeoffs (alternative processors, cache sizes, interconnect topology, bus sizing, ...)validation of more detailed modelsportfolio management (business planning)

    Scope:single cache coherent multiprocessor systems"well-behaved" commercial workloads

  • 4/28/03

    Constraintsworkloads are a moving target

    traces are very hard to get (esp. instruction traces)

    never have traces for all configurations of interest

    need broad design space exploration early in project

    need good accuracy

    detailed, low-level system models aren't a complete solution:detail is unavailable early in projectdon't have traces to drive configurations of interestdifficult to write and debug

    many, many configurations

  • 4/28/03

    Queueing System Model

    Mem Timer

    CPU Timer

    System Timer

    HWTradeoffs

    System Perf. Projections

    SW Optimization Ideas

    Contingency

    PerformanceModels

    ModelMiss Rates

    Performance Analysis Process

    Physical HW Workload Data Collection

    Counters BusTraces

    InstrTraces

    SW Pathlength & Scaling

    BenchmarkResults

    SamplingScheme

    OSMiddlewareApps

    CompilerTarget Benchmarks

    FullBenchmark

    ApproximateBenchmark

    BlackMagic

    L1/TLBCache Sim

    L2/L3Cache Sim

    "Raw"Miss Rates

    "Cooked"Miss Rates Extrapolate

    Miss Rate Analysis

    - Designer Intent- Informal Spec

    RTL

    forexecution-driven

    simulation

    fortrace-drivensimulation

    for probability-driven simulation

  • 4/28/03

    System Modelin g:Typical ModelSysmodel Environment

  • P16 The SPARC Microprocessor Design ©2002 Sun Microsystems, Inc.

    Figure 3-1: Architecture of the Sun Fire V1280 Server

    The SPARC Microprocessor Design

    The Version 9 Architecture

    The SPARC architecture has been implemented in processors used in a range of systems from

    laptops to supercomputers. SPARC International member companies have implemented

    numerous compatible microprocessors since the SPARC platform was first announced — more

    than any other RISC (reduced instruction set computing) microprocessor family. As a result, the

    SPARC architecture boasts the support of thousands of compatible software and hardware

    products. SPARC Version 9 maintains upwards binary compatibility for application software

    developed for previous SPARC architecture implementations, including microSPARC®,

    TurboSPARC®, SuperSPARC®, and previous versions of UltraSPARC.

    The SPARC V9 architecture represents a significant advance for the microprocessor industry. It

    provides 64-bit data and addressing, fault tolerance features, fast context switching, support for

    advanced compiler optimizations, efficient design for superscalar processors, and a clean

    structure for emerging operating systems. And all of this has been accomplished with 100-percent

    binary compatibility for existing applications.

    The UltraSPARC III Cu Processor

    The UltraSPARC III Cu processor is part of a third generation of UltraSPARC pipeline-based

    products. In addition to using a new process technology, the UltraSPARC III Cu processor provides

    a higher clock frequency, reduced on-chip latencies, support for greater amounts of level-one and

    level-two cache, and an integrated external memory controller. Other features include support for

    FireplaneFireplane

    Power DistributionBoard

    Power DistributionBoard

    IB_SSCPCI CardSystem ControllerSystem Controller

    SCSI G/Bit Ethernet Alarms

    I/O ControllerFan TrayFan Tray

    System IndicatorsSystem

    Indicators

    DiskDisk

    DATDAT

    DVD-ROMDVD-ROM

    SCCRSCCR

    10/100 Ethernet

    RJ45 Serial LOM/Console

    Rj45 Serial Reserved

    System Board(CPU/Memory)

    Level 2Repeater

    PowerSupply

    X3 X2

    X4

    Power FeedsAC or DC

    X6 X2

    I/O Chassis

  • P20 CPU/Memory Boards ©2002 Sun Microsystems, Inc.

    a critical design point since misses from large caches tend to cluster, with adjacent misses

    impacting performance in systems without high memory transfer rates.

    • System Interface Unit (SIU)

    Charged with the task of all other off-processor communications (memory, other processors,

    and I/O devices) the SIU can handle up to 15 pending transactions with support for full out-of-

    order data delivery on each transaction, enabling memory banks in a MP system service a

    request as soon as a bank is available. All processor interfaces use error detection and/or

    correction codes to quickly detect errors. In the event of an error on the system bus, an

    independent 8-bit-wide back door bus allows the use of automated diagnostics to isolate the

    problem.

    CPU/Memory BoardsThe Sun Fire V1280 server can accommodate up to twelve UltraSPARC III Cu processors populated

    on three CPU/memory boards. Each board includes four processors, all cache, and main memory.

    Memory can be added to a board after initial installation by trained service personnel. In addition,

    while all of the processors on a single CPU/memory board must be the same speed, other

    CPU/memory boards within the system may use processors clocked at a different speed. This

    mixed-speed CPU support results in better investment protection when upgrading by precluding

    the need to replace all of the existing processors in a system.

    The block diagram of the CPU/memory board used in the Sun Fire V1280 server is shown in

    Figure 3-4. Address and control paths are illustrated with dashed lines, and data path with solid

    lines. The interconnect components on the left connect to the Sun Fireplane interconnect switch

    boards. The bandwidths shown are the peak at each point on the board. (Electronically, the

    CPU/memory boards are identical to the Uniboard used in the Sun Fire 3800-6800 servers.

    However, they are physically incompatible and are not interchangeable.)

    Figure 3-4: CPU/Memory Board Block Diagram

    Board Data Switch

    Data pathController

    Address Repeater

    CPU Data Switch

    CPU and E-Cache

    Memory(8 GB)

    CPU and E-Cache

    2.4 GB/sec.

    Memory(8 GB)

    2.4 GB/sec.

    CPU Data Switch

    CPU and E-Cache

    Memory(8 GB)

    CPU and E-Cache

    2.4 GB/sec.

    Memory(8 GB)

    2.4 GB/sec.

    4.8 GB/sec.

    4.8 GB/sec.

    4.8 GB/sec.

    150 MillionAddresses/sec.

  • LD: remote clean (0.000201449)

    0% 100% 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

    "Core: L1 miss" (1)

    "L2 clock align" (2)

    ec_addr (4)9%

    "L2 miss" (10)

    sysq_addr_from_core (1)0%

    sysq_addr_to_sysbus (8)2%

    chip_inq (6)50%

    "switch" (6)

    a_sw_outq (6)53%

    "L2AR" (6)

    a_sw_outq (6)53%

    "switch" (6)

    chip_inq (6)50%

    ...ysq_addr_from_sysbus (8)59%

    "Remote BIU:" (16)

    mcQ (12.5)4%

    mbus_a (12)4%

    "DRAM read latency" (43.2)

    mbus_d (12)6%

    "MC: MC to BIU" (4)

    ...: MC data to sysBus" (4)

    sysq_data_to_sysbus (6)1%

    cpu_to_l1dx (6)13%

    "switch" (6)

    l2dxq (6)24%

    "switch" (12)

    l2dxq (6)24%

    "switch" (6)

    cpu_to_l1dx (6)13%

    ...ysq_data_from_sysbus (6)1%

    "Local BIU: sysBus to L2" (4)

    sysq_data_to_core (1)0%

    "l2slice: ld_reload" (3)

    dbus (1)12%

    "Core: Data to ld/st" (1)

  • 4/28/03

    Approach: Queueing Models

    easy to write and modify

    fast (analytic solution)

    good for gross tradeoffs of cache structures, bus bandwidths, topologies

    can be accurate: 5 to 10% (with good workload data!)

    can support a wide range of modeling detail: can approach that of simulation models

    validation aid for simulation models

    fosters a "crisp" understanding of what is important in a systemmodel

  • 4/28/03

    Sysmodel: A Solver/Analyzer for Queueing Network Models

    no suitable vendor offerings for analytic modeling

    manually writing analytic model equations is cumbersome and error-prone: it is much better to have a tool to take a high level description and create the equations automatically

    with a carefully designed high level description language, it is possible to automatically generate a simulation model from the same description that is used for the analytic model

    a good infrastructure around an analyzer/simulator adds a great deal of leverage:

    automatic miss rate loading from workload generatorconcise specification of configurations to analyzeautomatic timing diagram generationsophisticated table and chart generation

  • 4/28/03

    Sysmodel Modeling Environment

    Model Descr

    Builder

    SW Pathlength/

    Missrate Files

    Workload Database

    Workload Generator

    LoadableModule

    Analyzer/ Simulator

    Results:Tables, Plots

    Debug/Docum. Aids:- Timing Diagrams

    - Algebraic Util. Eqns.- Topology Plots

    ConfigurationsFile

    ��

  • 4/28/03

    Sysmodel Characteristics

    natural representation of memory hierarchy models: textual description of timing diagrams (cache coherence transactions)

    built on top of (embedded in) an existing programming language (ie. C or java):

    exploit existing compiler, debugger, IDE, libraries, etc.less work than creating an entirely new language

    modularity:support for hierarchically interconnected blocksit should not be necessary to recompile all components for a minor changeto only one component

    "first-class" parameter support:all parameters registered in a simulator databaseenforces better model structurebetter documentationused in tables of results

    ��

  • 4/28/03

    Structural Primitives

    BlockDef(subblock1) {In(in5, in5_type);In(in6, in6_type);Out(out3, out3_type);Resource(q3, 1);Resource(d3, 1e9);Resource(d4, 1e9);

    }

    BlockDef(subblock2) { ...}

    DefBlock(topblock, STRUCTPARM n) {int i;In(in1, in1_type);Out1D(out1, out1_type, 2);Resource(q1, 1);Net1D(net1, out3_type, n);...InstBlock1D(subblock1, subblock1, n);InstBlock(subblock2, subblock2);for(i=0; i

  • 4/28/03

    Serengeti Model Structure

    L2AR

    addr [nbrds](6:1/8B/0D/1)

    L2DX

    data [nbrds](6:1/32B/0D/1)

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    L1AR

    addr [nchips](6:1/8B/0D/1)

    L1DX

    data [nchips](6:1/32B/0D/1)

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    nchips

    board

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    L1AR

    addr [nchips](6:1/8B/0D/1)

    L1DX

    data [nchips](6:1/32B/0D/1)

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    nchips

    board

    ...

    nbrds

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    L1AR

    addr [nchips](6:1/8B/0D/1)

    L1DX

    data [nchips](6:1/32B/0D/1)

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    BIU

    L2EC addr (4:1/3B/0D/1)

    EC data (4:1/32B/2D/1)EC SRAM

    core

    InfL1CPI(1.3)

    mcDRAM addr (900:75/3B/1D/1)

    DRAM data (900:75/64B/1D/1)chip

    nchips

    board

    nbrds

    nbrds

    900MHzd64k32b4wi32k32b4w

    u8m512b4sb2w

    ��

  • 4/28/03

    Queueing Network Routing

    L2Itvn?

    Start:InfL1CPI

    L1Miss?

    Done

    No:1-P(L1Miss)

    Yes: P(L1Miss)

    L2Miss?

    Done

    No:1-P(L2Miss) Delay:

    L2Hit

    No:1-P(L2Itvn)

    Yes: P(L2Miss)

    Yes: P(L2Itvn)

    Done

    Queue:AddrBus

    Queue:Memory

    Queue:DataBus

    Delay:RetData

    Done

    Queue:AddrBus

    Delay:L2Access

    Queue:DataBus

    Delay:RetData

    Routing() {Start(CPU, InfL1CPI);Switch(); {

    Case(1-P_L1Miss); {Done();

    } EndCase();Default(); {

    Switch(); {Case(1-P_L2Miss); { Delay(T_L2Hit);} EndCase();Default(); { Switch(); { Case(P_L2Itvn); { /* L2 Intervention */ Queue(AddrBus); Queue(Memory); Queue(DataBus); Delay(RetData); } EndCase(); Default(); { /* go to memory */ Queue(AddrBus); Queue(Memory); Queue(DataBus); Delay(RetData); } EndDefault(); } EndSwitch();} EndDefault();

    } EndSwitch();} EndDefault();

    } EndSwitch();}

    ��

  • 4/28/03

    Sketch of Serengeti Routing

    Fork

    L1Miss?

    Done

    No:1-P(L1Miss)

    L2Miss? Done

    No:1-P(L2Miss)

    Delay:L2Hit

    Yes: P(L2Miss)

    Done

    Queue:AddrBus

    Queue:Memory

    Queue:DataBus

    Delay:RetData

    Queue:AddrBus

    Delay:L2Access

    Delay:InfL1CPI

    Yes: P(L1Miss)

    Queue:AddrBus

    Delay:L2Access

    Mutex:Data

    Wait:Data

    Delay:L2Access

    Queue:DataBus

    Delay:RetData

    Mutex:Data

    ...

    Queue:DataBus

    Delay:RetData

    Mutex:Data

    Parent Thread

    Delay:Wait forResponse

    Delay:Wait forResponse

    ��

  • 4/28/03

    MVA Solvercompute rates and classes from transaction flow graph:

    assign visit rates to each node in graph

    identify customer classes at each queueing center

    find subgraphs for each Send/Receive and Mutex/WaitMutex construct

    apply "classical" mean value analysis for product-form queueingnetworks, plus these approximations:

    Schweitzer's approximation to break recursion

    Reiser's approximation for deterministic service

    Heidelberger & Trivedi approximation for asynchronous background traffic

    a simple heuristic for Fork/Join (Send/Receive) behavior

    ��

  • 4/28/03

    Workload Characterization:Miss Rate Analysis

    ��

  • 4/28/03

    How Hard Can This Be?

    HP L.A.

    Tek L.A.

    Bus/InstrTraces

    cache sim

    per-instrmiss rates

    SW InstrTracer

    Simics

    ��

  • 4/28/03

    Workload Characterization:Miss Rate Analysis

    estimation of miss rates for cache configurations that we cannot directly simulate(ie. more processors than available in traces)

    includes additional manipulation of cache simulator output to:account for miss rate increases due to increased multiprogramming level ("MPL-effect")quickly estimate miss rates for cache configurations that haven't been simulated (but could be if we had the time)

    multidimensional curve fitting with dimensions of:cache sizeline sizeassociativitysector sizesharing# threads

    automatic fitting tools with seamless interface to sysmodel

    L1/TLBCache Sim

    L2/L3Cache Sim

    "Raw"Miss Rates

    "Cooked"Miss Rates Extrapolate

    Miss RateAnalysis

    ��

  • 4/28/03

    How Hard Can This Be?HP L.A.

    Tek L.A.

    RawTraces

    Refine/filter

    validate

    cache sim

    warminganalysis

    surface fit

    miss rate query

    Traces instd. fmt.

    cpustat busstat

    miss countsover time

    per-instrmiss rates

    full miss ratesurface

    .mr files plot

    plot

    manualinspection

    otherreference

    data

    validationreport

    - remove bad records- ecc- response matching

    lane,mosher,bj

    msp qmr

    bj

    mpcssumo

    bj

    ilya1 ilya2 qmr

    idle time adjustment

    ��

  • 4/28/03

    Miss Rate Surface FittingHastie and Tibshirani "Generalized Additive Models": Curve Fitting on Steroids

    10 miss rate components to model (clean, cache-to-cache, writeback, each per load/store/i-fetch)

    separate multivariate model per rate

    get sparse set of (mri, ci): mri is rate observed for cache configurationci = (size, line, subline, nthreads, sharing, assoc)

    fit an additive model with low order interactions:

    f(x) = Σp=1,P fp(xp) + Σp

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8 16 32 64

    LD C

    lean

    Mis

    s R

    ate

    (%/in

    str)

    Sharing

    LD Clean Miss Rate vs. Sharing(Associativity=2, LineSize=64, SubLine=64, Protocol=mosi, Detail=lru_wt)

    siz=1M,thrd=1,src=simsiz=2M,thrd=1,src=simsiz=1M,thrd=4,src=simsiz=2M,thrd=4,src=simsiz=1M,thrd=8,src=simsiz=2M,thrd=8,src=sim

    siz=1M,thrd=1,src=fitsiz=2M,thrd=1,src=fitsiz=8M,thrd=1,src=fitsiz=1M,thrd=2,src=fitsiz=2M,thrd=2,src=fitsiz=8M,thrd=2,src=fitsiz=1M,thrd=4,src=fitsiz=2M,thrd=4,src=fitsiz=8M,thrd=4,src=fitsiz=1M,thrd=8,src=fitsiz=2M,thrd=8,src=fitsiz=8M,thrd=8,src=fit

    siz=1M,thrd=64,src=fitsiz=2M,thrd=64,src=fitsiz=8M,thrd=64,src=fit

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    1 2 4 8 16 32 64

    LD W

    riteb

    ack

    Rat

    e (%

    /inst

    r)

    Sharing

    LD Writeback Rate vs. Sharing(Associativity=2, LineSize=64, SubLine=64, Protocol=mosi, Detail=lru_wt)

    siz=1M,thrd=1,src=simsiz=2M,thrd=1,src=simsiz=1M,thrd=4,src=simsiz=2M,thrd=4,src=simsiz=1M,thrd=8,src=simsiz=2M,thrd=8,src=sim

    siz=1M,thrd=1,src=fitsiz=2M,thrd=1,src=fitsiz=8M,thrd=1,src=fitsiz=1M,thrd=2,src=fitsiz=2M,thrd=2,src=fitsiz=8M,thrd=2,src=fitsiz=1M,thrd=4,src=fitsiz=2M,thrd=4,src=fitsiz=8M,thrd=4,src=fitsiz=1M,thrd=8,src=fitsiz=2M,thrd=8,src=fitsiz=8M,thrd=8,src=fit

    siz=1M,thrd=64,src=fitsiz=2M,thrd=64,src=fitsiz=8M,thrd=64,src=fit

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8 16 32 64

    LD M

    iss

    Rat

    e (%

    /inst

    r)

    Sharing

    LD Miss Rate vs. Sharing(Associativity=2, LineSize=64, SubLine=64, Protocol=mosi, Detail=lru_wt)

    siz=1M,thrd=1,src=simsiz=2M,thrd=1,src=simsiz=1M,thrd=4,src=simsiz=2M,thrd=4,src=simsiz=1M,thrd=8,src=simsiz=2M,thrd=8,src=sim

    siz=1M,thrd=1,src=fitsiz=2M,thrd=1,src=fitsiz=8M,thrd=1,src=fitsiz=1M,thrd=2,src=fitsiz=2M,thrd=2,src=fitsiz=8M,thrd=2,src=fitsiz=1M,thrd=4,src=fitsiz=2M,thrd=4,src=fitsiz=8M,thrd=4,src=fitsiz=1M,thrd=8,src=fitsiz=2M,thrd=8,src=fitsiz=8M,thrd=8,src=fit

    siz=1M,thrd=64,src=fitsiz=2M,thrd=64,src=fitsiz=8M,thrd=64,src=fit

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    1 2 4 8 16 32 64

    LD M

    odifi

    ed C

    -to-

    C T

    rans

    fer

    Rat

    e (%

    /inst

    r)

    Sharing

    LD Modified C-to-C Transfer Rate vs. Sharing(Associativity=2, LineSize=64, SubLine=64, Protocol=mosi, Detail=lru_wt)

    siz=1M,thrd=1,src=simsiz=2M,thrd=1,src=simsiz=1M,thrd=4,src=simsiz=2M,thrd=4,src=simsiz=1M,thrd=8,src=simsiz=2M,thrd=8,src=sim

    siz=1M,thrd=1,src=fitsiz=2M,thrd=1,src=fitsiz=8M,thrd=1,src=fitsiz=1M,thrd=2,src=fitsiz=2M,thrd=2,src=fitsiz=8M,thrd=2,src=fitsiz=1M,thrd=4,src=fitsiz=2M,thrd=4,src=fitsiz=8M,thrd=4,src=fitsiz=1M,thrd=8,src=fitsiz=2M,thrd=8,src=fitsiz=8M,thrd=8,src=fit

    siz=1M,thrd=64,src=fitsiz=2M,thrd=64,src=fitsiz=8M,thrd=64,src=fit

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8

    Cle

    an M

    iss

    Rat

    e (%

    /inst

    r)

    Size (MB)

    Clean Miss Rate vs. Size (MB)(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,thrd=1src=fit,thrd=1src=fit,thrd=2

    src=sim,thrd=4src=fit,thrd=4

    src=sim,thrd=8src=fit,thrd=8

    src=sim,thrd=16src=fit,thrd=16src=fit,thrd=32src=fit,thrd=64

    src=fit,thrd=128src=fit,thrd=256

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    1 2 4 8

    ST

    Upg

    rade

    Rat

    e (%

    /inst

    r)

    Size (MB)

    ST Upgrade Rate vs. Size (MB)(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,thrd=1src=fit,thrd=1src=fit,thrd=2

    src=sim,thrd=4src=fit,thrd=4

    src=sim,thrd=8src=fit,thrd=8

    src=sim,thrd=16src=fit,thrd=16src=fit,thrd=32src=fit,thrd=64

    src=fit,thrd=128src=fit,thrd=256

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8

    Tot

    al M

    iss

    (w/o

    Upg

    rade

    s) R

    ate

    (%/in

    str)

    Size (MB)

    Total Miss (w/o Upgrades) Rate vs. Size (MB)(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,thrd=1src=fit,thrd=1src=fit,thrd=2

    src=sim,thrd=4src=fit,thrd=4

    src=sim,thrd=8src=fit,thrd=8

    src=sim,thrd=16src=fit,thrd=16src=fit,thrd=32src=fit,thrd=64

    src=fit,thrd=128src=fit,thrd=256

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    1 2 4 8

    Mod

    ified

    C-t

    o-C

    Tra

    nsfe

    r R

    ate

    (%/in

    str)

    Size (MB)

    Modified C-to-C Transfer Rate vs. Size (MB)(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,thrd=1src=fit,thrd=1src=fit,thrd=2

    src=sim,thrd=4src=fit,thrd=4

    src=sim,thrd=8src=fit,thrd=8

    src=sim,thrd=16src=fit,thrd=16src=fit,thrd=32src=fit,thrd=64

    src=fit,thrd=128src=fit,thrd=256

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8 16 32 64 128 256

    Cle

    an M

    iss

    Rat

    e (%

    /inst

    r)

    Threads

    Clean Miss Rate vs. Threads(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,siz=1Msrc=fit,siz=1M

    src=sim,siz=2Msrc=fit,siz=2M

    src=sim,siz=4Msrc=fit,siz=4Msrc=fit,siz=8M

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    1 2 4 8 16 32 64 128 256

    ST

    Upg

    rade

    Rat

    e (%

    /inst

    r)

    Threads

    ST Upgrade Rate vs. Threads(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,siz=1Msrc=fit,siz=1M

    src=sim,siz=2Msrc=fit,siz=2M

    src=sim,siz=4Msrc=fit,siz=4Msrc=fit,siz=8M

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1 2 4 8 16 32 64 128 256

    Tot

    al M

    iss

    (w/o

    Upg

    rade

    s) R

    ate

    (%/in

    str)

    Threads

    Total Miss (w/o Upgrades) Rate vs. Threads(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,siz=1Msrc=fit,siz=1M

    src=sim,siz=2Msrc=fit,siz=2M

    src=sim,siz=4Msrc=fit,siz=4Msrc=fit,siz=8M

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    1 2 4 8 16 32 64 128 256

    Mod

    ified

    C-t

    o-C

    Tra

    nsfe

    r R

    ate

    (%/in

    str)

    Threads

    Modified C-to-C Transfer Rate vs. Threads(Associativity=2, LineSize=64, SubLine=64, Sharing=1, Protocol=mosi, Detail=lru_wt)

    src=sim,siz=1Msrc=fit,siz=1M

    src=sim,siz=2Msrc=fit,siz=2M

    src=sim,siz=4Msrc=fit,siz=4Msrc=fit,siz=8M

  • 4/28/03

    Workload Characterization:Processor Abstraction

    ��

  • Overview

    How to model memory level parallelism?

    Processor with no memory level parallelism:

    Cache miss 1: Processorblocks waiting for data

    Data available

    Processor with memory parallelism:

    Cache miss 1 Cache miss 2 Processor blocks:out of resources

    Cache miss 2

  • Model

    Simple processor model (h stands for hit rate, p for penalty, and n for number of cache levels):

    Model for processor with parallelism (f stands for blocking factor):

    CPI=CPI ���i =0

    n�1

    hi pi

    CPI=CPI���i =0

    n�1

    hi f i pi

  • Questions

    Does blocking factor depend on

    latency?

    cache level?

    access type (data/instruction load/store)?

    miss rates?

  • Results

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    Blocking factor for 1 thread CPU

    Test case

    Blo

    ckin

    g fa

    ctor

  • Results

    Average block factors

    1 thread CPU: 0.42

    2 thread CPU: 0.45

    Infinite L1 CPI:

    1 thread CPU: 1.47 (1.485 measured)

    2 thread CPU: 1.93 (1.94 measrured)

    Average error in CPI:

    1 thread CPU: 6.27%

    2 thread CPU: 6.05%

  • Results

    -17

    -16

    -15

    -14

    -13

    -12

    -10

    -9 -8 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    5.5

    6

    Blocking factor error histogram 2 threads

    Percent error

    Test

    cou

    nt

  • 4/28/03

    Observationsgood news for probabilistic modeling of out-of-order processors:

    blocking factor is relatively insensitive to broad variations in miss penalties and cache sizes

    as latency goes to infinity, blocking factor does NOT go to 1

    additional work shows that making blocking factors a (weak)function of latency reduces errors substantially

    ongoing work to understand this further

    ��

  • 4/28/03

    Validation:Model vs. Measurement

    ��

  • 4/28/03

    Serengeti Model vs. Measurementall configurations:

    900MHz Cheetah+ processor, 225MHz L2, 150MHz centerplaneu8m64b1w L2Large pages enabled

    24w Serengeti configurations:Oracle 902, Solaris 5.9 Build 58, log_parallelism = 2data from PAE memory scaling runs

    8w Serengeti configuration:Oracle 902 process model, s9

    Name # Procs tpmC(K)SGA(GB) CPI

    Pathlength(Minstr) IO's/Trans

    8w 8 43.5 14 5.033 0.8821 9.2424w_48 24 102 48 5.720 0.9847 9.1124w_96 24 130 96 5.285 0.8368 5.9124w_180 24 142 180 4.996 0.8113 4.68

    ��

  • 4/28/03

    4.72

    3 (-

    6.3%

    )

    5.25

    8 (-

    8.0%

    )

    4.85

    (-8

    .2%

    )

    4.63

    1 (-

    7.3%

    )

    5.04

    2 5.7

    2

    5.28

    5

    4.99

    5

    8w 24w48gb 24w96gb 24w180gb0

    1

    2

    3

    4

    5

    6

    7

    CP

    I

    Measured Model

    TpmC CPI: Model vs. Measured

    ��

  • 4/28/03

    Observations

    good correlation (within 8%)

    highest utilization is about 50% (system address busses),so system experiences low queueing congestion

    why is model optimistic?are miss rates from counters correct?are static latencies correct?bursty memory accesses?kernel cage effects??

    ��

  • 4/28/03

    Conclusion

    ��

  • 4/28/03

    ConclusionGoal: timely, accurate performance projections for Sun systems

    changing workloadspaucity of tracesneed for early design space explorationmany, many configurations to analyze

    hierarchical models are effectivedetailed timers for processorsqueueing network models for systempoint models for miss rate analysis

    modeling accuracy and breadth of design space determined byworkload characterization methodology:

    bus traces and instruction tracesthorough validationautomatic surface fitting

    ��

  • 4/28/03

    Future Directions

    extend techniques to clustered, multi-tiered systems

    extend techniques to large floating-point computation architectures

    refine processor abstractions

    ��

  • 4/28/03

    Backup Slides

    ��

  • 4/28/03

    Tpcc Bus Trace Validation Resultsu512k64b1w MOESI

    3.04

    2.94

    2.85

    2.76 2.

    873.1

    3

    3.04

    2.94

    2.85 2.

    963.02

    2.96

    2.83

    2.78

    2.7

    3.02

    2.96

    2.83

    2.78

    2.7

    1w 4w 8w 16w 8ws150

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    %/in

    str

    Trace

    Resim

    cpustat

    busstat

    Total Miss Rate

    0.75

    0.75

    0.73

    0.69

    6

    0.730.75

    0.74

    7

    0.73

    3

    0.69

    9

    0.73

    2

    0.75 0.

    804

    0.77

    7

    0.73

    3

    0.71

    3

    0.75

    1

    0.75

    0.73

    1

    0.69

    8

    0.65

    8

    1w 4w 8w 16w 8ws150

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    %/in

    str

    Trace

    Resim

    cpustat

    busstat

    Writeback Rate0

    0.18

    6

    0.20

    1

    0.22

    6

    0.20

    1

    0.02

    0.18

    8

    0.20

    3

    0.22

    9

    0.20

    3

    1w 4w 8w 16w 8ws150

    0.05

    0.1

    0.15

    0.2

    0.25

    %/in

    str

    Trace

    Resim

    cpustat

    busstat

    Cache-to-Cache Transfer Rate

    0

    0.04

    8

    0.05

    9 0.0

    7 3

    0.04

    5

    0

    0.04

    8

    0.06

    0.07

    38

    0.04

    5

    1w 4w 8w 16w 8ws150

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    %/in

    str

    Trace

    Resim

    cpustat

    busstat

    Upgrade Rate

    ��

  • 4/28/03

    Sketch of Solver Algorithm/* prepare graph */

    - assign visit rates to each node

    - identify customer classes at each queueing center

    - find subgraphs for each Send/Receive and

    Mutex/WaitMutex construct

    /* solve the network for response time R (= cpi) */

    Rnew = 10; /* a guess */

    repeat {Rold = Rnewfor (each queue i) {

    Ri = Si + Si*Qi(N-1)Qi = N/Rold * Vi * Ri /* Little's Law: L = λ*W */

    }- traverse routing graph and compute Rnew

    } until (|Rnew - Rold| < δ)

    /* traversing routing graph to compute Rnew: */

    - using depth-first search, compute Rthread for each thread as follows:

    - at visit nodes (to queue i):Rthread += Vi * Ri

    - at Receive(min) nodes:Rthread = max(Rthread, min{Rsend,j} )

    - at Receive(max) nodes:Rthread = max(Rthread, max{Rsend,j} )

    - at WaitMutex nodes:Rthread = max(Rthread, Σj Vmutex,j*Rmutex,j )

    - Rnew is Rthread for root thread

    ��

  • 4/28/03

    Basic MVA/* single class */

    Rnew = 10; /* a guess */repeat {

    Rold = Rnewfor (each queue i) {

    Ri = Si + Si*Qi(N-1)Qi = N/Rold * Vi * Ri /* Little's Law: L = λ*W */

    }Rnew = Σi Ri

    } until (|Rnew - Rold| < δ)

    Qi(N-1) here means Qi in same network, but with 1

    fewer customer.

    /* Schweitzer's approx. to break recursion */

    Assume Qi(N-1)/(N-1) O Qi(N)/N

    Hence: Qi(N-1) O (N-1)/N * Qi(N) K (N-1)/N * Qi

    Ref: P. Schweitzer, "Approximate Analysis of Multiclass Closed Networks of Queues", JACM, 1981.

    /* multi-class */

    Rnew = 10; /* a guess */repeat {

    Rold = Rnewfor (each queue i) {

    for (each class c) {

    Ri,c = Si,c + ΣkJc Si,k*Qi,k(N-1c)Qi,c = Nc/Rold * Vi,c * Ri,c /* L = λ*W */

    }}

    Rnew = Σi Σc Ri,c} until (|Rnew - Rold| < δ)

    /* multiclass approx. to break recursion */

    Qi,k(N-1c) O (Nc-1)/Nc * Qi,c(N) + ΣkJc Qi,k(N)

    ��

  • 4/28/03

    Additional MVA Approximations/* Deterministic Service */

    instead of:

    Ri = Si + (N-1)/N * Si*Qi

    use:

    Ri = Si + (N-1)/N * [(Qi - Bi)*Si + Bi * Si /2]

    where Bi is the probab

    ility that an arriving customer

    finds the server busy, approximated by:

    Bi = (Ui - Ui /N)/(1 - Ui /N)

    Ref: M. Reiser, "A Queueing Network Analysis of Computer Communication Networks with Window Flow

    Control", IEEE Trans. Comm., 1979.

    /* Asynchronous Background Traffic */

    - same as non-background traffic, except do not includeRi,c for background classes in summation for Rnew

    Ref: P. Heidelberger and K. S. Trivedi, "Queueing Network Models for Parallel Processing with Asynchronous

    Tasks", IEEE Trans. on Computers, 1982.

    ��

  • 4/28/03

    Additional MVA Approximations/* Send/Receive (Fork/Join) */

    Assume E(min(A,B)) = min(E(A), E(B))

    or E(max(A,B)) = max(E(A), E(B))

    - at Receive(min) nodes:Rthread = max(Rthread, min{Rsend,j} )

    - at Receive(max) nodes:Rthread = max(Rthread, max{Rsend,j} )

    Receive(max):Local Response

    Fork Fork

    Delay:LocResp

    Send:Local Response

    Delay:LocResp

    Send:Local Response

    Why do we expect this to be a reasonable approximation?

    in multiprocessor models, service distributions are usually deterministic, and utilizations are often not extreme

    The starcat model (TBD) will be the first real test of this approximation.

    Simulation feature of sysmodel will be used for validation.

    ��