lecture 1: course introduction and overviewsoner/classes/cs-4431/pdf-slides/lecture-12.pdf ·...

108
Multiprocessors (based on D. Patterson’s lectures and Hennessy/Patterson’s book)

Upload: others

Post on 15-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Multiprocessors

    (based on D. Patterson’s lectures and Hennessy/Patterson’s book)

  • 2Parallel Computers

    Definition: “A parallel computer is a collection of processiong elements that cooperate and communicate to solve large problems fast.”

    Almasi and Gottlieb, Highly Parallel Computing ,1989

    Questions about parallel computers:• How large a collection?• How powerful are processing elements?• How do they cooperate and communicate?• How are data transmitted? • What type of interconnection?• What are HW and SW primitives for programmer?• Does it translate into performance?

  • 3Parallel Processors “Religion”

    The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processorLed to innovative organization tied to particular programming models since “uniprocessors can’t keep going”

    • e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972, … , 1989

    • Borders religious fervor: you must believe!• Fervor damped some when 1990s companies went out of business:

    Thinking Machines, Kendall Square, ...

    Argument instead is the “pull” of opportunity of scalable performance, not the “push” of uniprocessor performance plateau?

  • 4What level Parallelism?Bit level parallelism: 1970 to ~1985

    • 4 bits, 8 bit, 16 bit, 32 bit microprocessors

    Instruction level parallelism (ILP): ~1985 through today

    • Pipelining• Superscalar• VLIW• Out-of-Order execution• Limits to benefits of ILP?

    Process Level or Thread level parallelism; mainstream for general purpose computing?

    • Servers are parallel• Highend Desktop dual processor PC soon??

    (or just the sell the socket?)

  • 5Why Multiprocessors?Microprocessors as the fastest CPUs

    • Collecting several much easier than redesigning 1

    Complexity of current microprocessors• Do we have enough ideas to sustain 1.5X/yr?• Can we deliver such complexity on schedule?

    Slow (but steady) improvement in parallel software (scientific apps, databases, OS)

    Emergence of embedded and server markets driving microprocessors in addition to desktops• Embedded functional parallelism, producer/consumer model• Server figure of merit is tasks per hour vs. latency

  • 6Parallel Processing Intro

    Long term goal of the field: scale number processors to size of budget, desired performanceMachines today: Sun Enterprise 10000 (8/00)

    • 64 400 MHz UltraSPARC® II CPUs,64 GB SDRAM memory, 868 18GB disk,tape

    • $4,720,800 total • 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16%

    ($10,800 per processor or ~0.2% per processor)• Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700• $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU)

    Machines today: Dell Workstation 220 (2/01)• 866 MHz Intel Pentium® III (in Minitower)• 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, nVIDIA

    GeForce 2 GTS,32MB DDR Graphics card, 1yr service• $1,600; for extra processor, add $350 (~20%)

  • 7Whither Supercomputing?Linpack (dense linear algebra) for Vector Supercomputers vs. Microprocessors“Attack of the Killer Micros”

    • (see Chapter 1, Figure 1-10, page 22 of [CSG99])• 100 x 100 vs. 1000 x 1000

    MPPs vs. Supercomputers when rewrite linpack to get peak performance

    • (see Chapter 1, Figure 1-11, page 24 of [CSG99])

    1997, 500 fastest machines in the world: 319 MPPs, 73 bus-based shared memory (SMP), 106 parallel vector processors (PVP)

    • (see Chapter 1, Figure 1-12, page 24 of [CSG99])

    2000, 381 of 500 fastest: 144 IBM SP (~cluster), 121 Sun (bus SMP), 62 SGI (NUMA SMP), 54 Cray (NUMA SMP)

    [CSG99] = Parallel computer architecture : a hardware/ software approach, David E. Culler, Jaswinder Pal Singh, with Anoop Gupta. San Francisco : Morgan Kaufmann, c1999.

  • 8

    Popular Flynn Categories (e.g., ~RAID level for MPPs)SISD (Single Instruction Single Data)

    • UniprocessorsMISD (Multiple Instruction Single Data)

    • ???; multiple processors on a single data streamSIMD (Single Instruction Multiple Data)

    • Examples: Illiac-IV, CM-2• Simple programming model• Low overhead• Flexibility• All custom integrated circuits

    • (Phrase reused by Intel marketing for media instructions ~ vector)MIMD (Multiple Instruction Multiple Data)

    • Examples: Sun Enterprise 5000, Cray T3D, SGI Origin• Flexible• Use off-the-shelf micros

    MIMD current winner: Concentrate on major design emphasis

  • 9Major MIMD Styles

    Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor")

    Decentralized memory (memory module with CPU) • get more memory bandwidth, lower memory latency• Drawback: Longer communication latency• Drawback: Software model more complex

  • 10Decentralized Memory versions

    Shared Memory with "Non Uniform Memory Access" time (NUMA)

    Message passing "multicomputer" with separate address space per processor• Can invoke software with Remote Procedue Call (RPC)• Often via library, such as MPI: Message Passing

    Interface• Also called "Synchronous communication" since

    communication causes synchronization between 2 processes

  • 11

    Performance Metrics: Latency and Bandwidth

    Bandwidth• Need high bandwidth in communication• Match limits in network, memory, and processor• Challenge is link speed of network interface vs. bisection

    bandwidth of network

    Latency• Affects performance, since processor may have to wait• Affects ease of programming, since requires more thought to

    overlap communication and computation• Overhead to communicate is a problem in many machines

    Latency Hiding• How can a mechanism help hide latency?• Increases programming system burden• Examples: overlap message send with computation, prefetch

    data, switch to other tasks

  • 12Parallel Architecture

    Parallel Architecture extends traditional computer architecture with a communication architecture

    • abstractions (HW/SW interface)• organizational structure to realize abstraction

    efficiently

  • 13Parallel Framework

    Layers:• (see Chapter 1, Figure 1-13, page 27 of [CSG99])• Programming Model:

    • Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data Parallel: several agents operate on several data sets

    simultaneously and then exchange information globally and simultaneously (shared or message passing)

    • Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling)

    => many hardware designs 1:1 programming model

  • 14Shared Address Model Summary

    Each processor can name every physical location in the machine

    Each process can name all data it shares with other processes

    Data transfer via load and store

    Data size: byte, word, ... or cache blocks

    Uses virtual memory to map virtual to local or remote physical

    Memory hierarchy model applies: now communication moves data to local processor cache (as load moves data from memory to cache)

    • Latency, BW, scalability when communicate?

  • 15

    Shared Address/Memory Multiprocessor Model

    Communicate via Load and Store• Oldest and most popular model

    Based on timesharing: processes on multiple processors vs. sharing single processorprocess: a virtual address space and ~ 1 thread of control

    • Multiple processes can overlap (share), but ALL threadsshare a process address space

    Writes to shared address space by one thread are visible to reads of other threads

    • Usual model: share code, private stack, some shared heap, some private heap

  • 16SMP Interconnect

    Processors to Memory AND to I/O

    Bus based: all memory locations equal access time so SMP = “Symmetric MP”

    • Sharing limited BW as add processors, I/O• (see Chapter 1, Figs 1-17, page 32-33 of

    [CSG99])

  • 17Message Passing Model

    Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations

    • Essentially NUMA but integrated at I/O devices vs. memory system

    Send specifies local buffer + receiving process on remote computerReceive specifies sending process on remote computer + local buffer to place data

    • Usually send includes process tag and receive has rule on tag: match 1, match any

    • Synch: when send completes, when buffer free, when request accepted, receive wait for send

    Send+receive => memory-memory copy, where each supplies local address, AND does pair wise synchronization!

  • 18Data Parallel Model

    Operations can be performed in parallel on each element of a large regular data structure, such as an array

    1 Control Processor broadcast to many PEs (see Ch. 1, Fig. 1-25, page 45 of [CSG99])

    • When computers were large, could amortize the control portion of many replicated PEs

    Condition flag per PE so that can skip

    Data distributed in each memory

    Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE

    Data parallel programming languages lay out data to processor

  • 19Data Parallel Model

    Vector processors have similar ISAs, but no data placement restriction

    SIMD led to Data Parallel Programming languages

    Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive)

    SIMD programming model led to Single Program Multiple Data (SPMD) model

    • All processors execute identical program

    Data parallel programming languages still useful, do communication all at once:“Bulk Synchronous” phases in which all communicate after a global barrier

  • 20

    Advantages shared-memory communication model

    Compatibility with SMP hardware

    Ease of programming when communication patterns are complex or vary dynamically during execution

    Ability to develop apps using familiar SMP model, attention only on performance critical accesses

    Lower communication overhead, better use of BW for small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through I/O system

    HW-controlled caching to reduce remote comm. by caching of all data, both shared and private.

  • 21

    Advantages message-passing communication model

    The hardware can be simpler (esp. vs. NUMA)

    Communication explicit => simpler to understand; in shared memory it can be hard to know when communicating and when not, and how costly it is

    Explicit communication focuses attention on costly aspect of parallel computation, sometimes leading to improved structure in multiprocessor program

    Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization

    Easier to use sender-initiated communication, which may have some advantages in performance

  • 22

    Communication Models

    Shared Memory• Processors communicate with shared address space• Easy on small-scale machines• Advantages:

    • Model of choice for uniprocessors, small-scale MPs• Ease of programming• Lower latency• Easier to use hardware controlled caching

    Message passing• Processors have private memories,

    communicate via messages• Advantages:

    • Less hardware, easier to design• Focuses attention on costly non-local operations

    Can support either SW model on either HW base

  • 23

    Amdahl’s Law and Parallel ComputersAmdahl’s Law (FracX: original % to be speed up)Speedup = 1 / [(FracX/SpeedupX + (1-FracX)]

    A portion is sequential => limits parallel speedup• Speedup

  • 24Small-Scale—Shared MemoryCaches serve to:

    • Increase bandwidth versus bus/memory

    • Reduce latency of access

    • Valuable for both private data and shared data

    What about cache consistency?

    Time

    Event

    $ A

    $ B

    X (memory)

    0

    1

    1

    CPU A reads X

    1

    1

    2

    CPU B reads X

    1

    1

    1

    3

    CPU A stores 0 into X

    0

    1

    0

  • 25What Does Coherency Mean?Informally:

    • “Any read must return the most recent write”• Too strict and too difficult to implement

    Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order (“serialization”)

    Two rules to ensure this:• “If P writes x and P1 reads it, P’s write will be seen by P1 if the read

    and write are sufficiently far apart”• Writes to a single location are serialized:

    seen in one order• Latest write will be seen• Otherwise could see writes in illogical order

    (could see older value after a newer value)

  • 26Potential HW Coherency Solutions

    Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Processors snoop to see if they have a copy and respond accordingly • Requires broadcast, since caching information is at processors• Works well with bus (natural broadcast medium)• Dominates for small scale machines (most of the market)

    Directory-Based Schemes (discuss later)• Keep track of what is being shared in 1 centralized place (logically)• Distributed memory => distributed directory for scalability

    (avoids bottlenecks)• Send point-to-point requests to processors via network• Scales better than Snooping• Actually existed BEFORE Snooping-based schemes

  • 27

    Basic Snoopy Protocols

    Write Invalidate Protocol:• Multiple readers, single writer• Write to shared data: an invalidate is sent to all caches which snoop

    and invalidate any copies• Read Miss:

    • Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy

    Write Broadcast Protocol (typically write through):• Write to shared data: broadcast on bus, processors snoop, and

    update any copies• Read miss: memory is always up-to-date

    Write serialization: bus serializes requests!• Bus is single point of arbitration

  • 28Basic Snoopy Protocols

    Write Invalidate versus Broadcast:• Invalidate requires one transaction per write-run• Invalidate uses spatial locality: one transaction per block• Broadcast has lower latency between write and read

  • 29Snooping Cache Variations

    Berkeley Protocol

    Owned ExclusiveOwned Shared

    SharedInvalid

    Basic Protocol

    ExclusiveSharedInvalid

    Illinois ProtocolPrivate DirtyPrivate Clean

    SharedInvalid

    Owner can update via bus invalidate operationOwner must write back when replaced in cache

    If read sourced from memory, then Private Cleanif read sourced from other cache, then SharedCan write in cache if held private clean or dirty

    MESI Protocol

    Modified (private,!=Memory)exclusive (private,=Memory)Shared (shared,=Memory)

    Invalid

  • 30An Example Snoopy ProtocolInvalidation protocol, write-back cacheEach block of memory is in one state:

    • Clean in all caches and up-to-date in memory (Shared)• OR Dirty in exactly one cache (Exclusive)• OR Not in any caches

    Each cache block is in one state (track these):• Shared : block can be read• OR Exclusive : cache has only copy, its writeable, and dirty• OR Invalid : block contains no data

    Read misses: cause all caches to snoop busWrites to clean line are treated as misses

  • 31

    Snoopy-Cache State Machine-I

    State machinefor CPU requestsfor each cache block

    InvalidShared

    (read/only)

    Exclusive(read/write)

    CPU Read

    CPU Write

    CPU Read hit

    Place read misson bus

    Place Write Miss on bus

    CPU read missWrite back block,Place read misson bus

    CPU WritePlace Write Miss on Bus

    CPU Read missPlace read miss on bus

    CPU Write MissWrite back cache blockPlace write miss on bus

    CPU read hitCPU write hit

    Cache BlockState

  • 32Snoopy-Cache State Machine-II

    State machinefor bus requestsfor each cache blockAppendix E? gives details of bus requests

    Invalid Shared(read/only)

    Exclusive(read/write)

    Write BackBlock; (abortmemory access)

    Write missfor this block

    Read miss for this block

    Write missfor this block

    Write BackBlock; (abortmemory access)

  • 33

    Place read misson bus

    Snoopy-Cache State Machine-III State machinefor CPU requestsfor each cache block andfor bus requestsfor each cache block

    InvalidShared

    (read/only)

    Exclusive(read/write)

    CPU Read

    CPU Write

    CPU Read hit

    Place Write Miss on bus

    CPU read missWrite back block,Place read misson bus CPU Write

    Place Write Miss on Bus

    CPU Read missPlace read miss on bus

    CPU Write MissWrite back cache blockPlace write miss on bus

    CPU read hitCPU write hit

    Cache BlockState

    Write missfor this block

    Write BackBlock; (abortmemory access)

    Write missfor this block

    Read miss for this block

    Write BackBlock; (abortmemory access)

  • 34Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1P1: Read A1P2: Read A1

    P2: Write 20 to A1P2: Write 40 to A2

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block,initial cache state is invalid

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1

    P1: Read A1

    P2: Read A1

    P2: Write 20 to A1

    P2: Write 40 to A2

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 35Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1P2: Read A1

    P2: Write 20 to A1P2: Write 40 to A2

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1Excl.A110WrMsP1A1

    P1: Read A1

    P2: Read A1

    P2: Write 20 to A1

    P2: Write 40 to A2

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 36Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1

    P2: Write 20 to A1P2: Write 40 to A2

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1Excl.A110WrMsP1A1

    P1: Read A1Excl.A110

    P2: Read A1

    P2: Write 20 to A1

    P2: Write 40 to A2

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 37Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 WrBk P1 A1 10 A1 10Shar. A1 10 RdDa P2 A1 10 A1 10

    P2: Write 20 to A1P2: Write 40 to A2

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1Excl.A110WrMsP1A1

    P1: Read A1Excl.A110

    P2: Read A1Shar.A1RdMsP2A1

    Shar.A110WrBkP1A110A110

    Shar.A110RdDaP2A110A110

    P2: Write 20 to A1

    P2: Write 40 to A2

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 38Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 WrBk P1 A1 10 A1 10Shar. A1 10 RdDa P2 A1 10 A1 10

    P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10P2: Write 40 to A2

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1Excl.A110WrMsP1A1

    P1: Read A1Excl.A110

    P2: Read A1Shar.A1RdMsP2A1

    Shar.A110WrBkP1A110A110

    Shar.A110RdDaP2A110A110

    P2: Write 20 to A1Inv.Excl.A120WrMsP2A1A110

    P2: Write 40 to A2

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 39Example

    P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

    P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 WrBk P1 A1 10 A1 10Shar. A1 10 RdDa P2 A1 10 A1 10

    P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10P2: Write 40 to A2 WrMs P2 A2 A1 10

    Excl. A2 40 WrBk P2 A1 20 A1 20

    P1: Read A1P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1P2: Write 40 to A2

    Assumes A1 and A2 map to same cache block,but A1 != A2

    Snoopy Start Example

    P1P2BusMemory

    stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

    P1: Write 10 to A1Excl.A110WrMsP1A1

    P1: Read A1Excl.A110

    P2: Read A1Shar.A1RdMsP2A1

    Shar.A110WrBkP1A110A110

    Shar.A110RdDaP2A110A110

    P2: Write 20 to A1Inv.Excl.A120WrMsP2A1A110

    P2: Write 40 to A2WrMsP2A2A110

    Excl.A240WrBkP2A120A120

    &F

    Button 3

    Button 4

    P1: Read A1

    P2: Read A1

    P1 Write 10 to A1

    P2: Write 20 to A1

    P2: Write 40 to A2

  • 40Implementation Complications

    Write Races:• Cannot update cache until bus is obtained

    • Otherwise, another processor may get bus first, and then write the same cache block!

    • Two step process:• Arbitrate for bus • Place miss on bus and complete operation

    • If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart.

    • Split transaction bus:• Bus transaction is not atomic:

    can have multiple outstanding transactions for a block• Multiple misses can interleave,

    allowing two caches to grab block in the Exclusive state• Must track and prevent multiple misses for one block

    Must support interventions and invalidations

  • 41Implementing Snooping Caches

    Multiple processors must be on bus, access to both addresses and dataAdd a few new commands to perform coherency, in addition to read and writeProcessors continuously snoop on address bus

    • If address matches tag, either invalidate or update

    Since every bus transaction checks cache tags, could interfere with CPU just to check:

    • solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU

    • solution 2: L2 cache already duplicate, provided L2 obeys inclusion with L1 cache

    • block size, associatively of L2 affects L1

  • 42Implementing Snooping Caches

    Bus serializes writes, getting bus ensures no one else can perform memory operation

    On a miss in a write back cache, may have the desired copy and its dirty, so must reply

    Add extra state bit to cache to determine shared or not

    Add 4th state (MESI)

  • 43Fundamental Issues

    3 Issues to characterize parallel machines

    1) Naming

    2) Synchronization

    3) Performance: Latency and Bandwidth (covered earlier)

  • 44Fundamental Issue #1: Naming

    Naming: how to solve large problem fast• what data is shared• how it is addressed• what operations can access data• how processes refer to each other

    Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passingChoice of naming affects replication of data; via load in cache memory hierarchy or via SW replication and consistency

  • 45Fundamental Issue #1: Naming

    Global physical address space: any processor can generate, address and access it in a single operation

    • memory can be anywhere: virtual addr. translation handles it

    Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program

    Segmented shared address space: locations are named uniformly for all processes of the parallel program

  • 46Fundamental Issue #2: Synchronization

    To cooperate, processes must coordinate

    Message passing is implicit coordination with transmission or arrival of data

    Shared address => additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

  • 47Summary: Parallel Framework

    Layers:• Programming Model:

    • Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and recieve messages• Data Parallel: several agents operate on several data sets

    simultaneously and then exchange information globally and simultaneously (shared or message passing)

    • Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling)

    => many hardware designs 1:1 programming model

    Programming ModelCommunication AbstractionInterconnection SW/OS Interconnection HW

  • 48Review: MultiprocessorBasic issues and terminologyCommunication: share memory, message passingParallel Application:

    • Commercial workload: OLTP, DSS, Web index search• Multiprogramming and OS• Scientific/Technical

    Amdahl’s Law: Speedup

  • 49Larger MPsSeparate Memory per ProcessorLocal or Remote access via memory controller1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache

    • Which caches have a copies of block, dirty vs. clean, ...

    Info per memory block vs. per cache block?• PLUS: In memory => simpler protocol (centralized/one

    location)• MINUS: In memory => directory is ƒ(memory size) vs.

    ƒ(cache size)

    Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

  • 50Distributed Directory MPs

  • 51Directory ProtocolSimilar to Snoopy Protocol: Three states

    • Shared: ≥ 1 processors have data, memory up-to-date• Uncached (no processor hasit; not valid in any cache)• Exclusive: 1 processor (owner) has data;

    memory out-of-date

    In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)

    Keep it simple(r):• Writes to non-exclusive data

    => write miss• Processor blocks until access completes• Assume messages received

    and acted upon in order sent

  • 52Directory Protocol

    No bus and don’t want to broadcast:• interconnect no longer single arbitration point• all messages have explicit responses

    Terms: typically 3 processors involved• Local node where a request originates• Home node where the memory location

    of an address resides• Remote node has a copy of a cache

    block, whether exclusive or shared

    Example messages on next slide: P = processor number, A = address

  • 53

    Directory Protocol Messages

    Message type Source Destination Msg ContentRead miss Local cache Home directory P, A

    • Processor P reads data at address A; make P a read sharer and arrange to send data back

    Write miss Local cache Home directory P, A• Processor P writes data at address A;

    make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A

    • Invalidate a shared copy at address A.Fetch Home directory Remote cache A

    • Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

    • Fetch the block at address A and send it to its home directory; invalidate the block in the cache

    Data value reply Home directory Local cache Data• Return a data value from the home memory (read miss response)

    Data write-back Remote cache Home directory A, Data• Write-back a data value for address A (invalidate response)

  • 54

    State Transition Diagram for an Individual Cache Block in a Directory Based System

    States identical to snoopy case; transactions very similar.

    Transitions caused by read misses, write misses, invalidates, data fetch requests

    Generates read miss & write miss msg to home directory.

    Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

    Note: on a write, a cache block is bigger, so need to read the full cache block

  • 55

    CPU -Cache State Machine

    State machinefor CPU requestsfor each memory block

    Invalid stateif in memory

    Fetch/Invalidatesend Data Write Back message

    to home directory

    Invalidate

    InvalidShared

    (read/only)

    Exclusive(read/writ)

    CPU Read

    CPU Read hit

    Send Read Missmessage

    CPU Write:Send Write Miss msg to h.d.

    CPU Write:Send Write Miss messageto home directory

    CPU read hitCPU write hit

    Fetch: send Data Write Back message to home directory

    CPU read miss:Send Read Miss

    CPU write miss:send Data Write Back message and Write Miss to home directory

    CPU read miss: send Data Write Back message and read miss to home directory

  • 56

    Same states & structure as the transition diagram for an individual cache

    2 actions: update of directory state & send msgs to statisfy requests

    Tracks all copies of memory block.

    Also indicates an action that updates the sharing set, Sharers, as well as sending a message.

    State Transition Diagram for Directory

  • 57

    Directory State Machine

    State machinefor Directory requests for each memory block

    Uncached stateif in memory

    Data Write Back:Sharers = {}

    (Write back block)

    UncachedShared

    (read only)

    Exclusive(read/writ)

    Read miss:Sharers = {P}send Data Value Reply

    Write Miss: send Invalidate to Sharers;then Sharers = {P};send Data Value Reply msg

    Write Miss:Sharers = {P}; send Data Value Replymsg

    Read miss:Sharers += {P}; send Fetch;send Data Value Replymsg to remote cache(Write back block)

    Read miss: Sharers += {P};send Data Value Reply

    Write Miss:Sharers = {P}; send Fetch/Invalidate;send Data Value Replymsg to remote cache

  • 58

    Example Directory Protocol

    Message sent to directory causes two actions:• Update the directory• More messages to satisfy request

    Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

    • Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

    • Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

    Block is Shared => the memory value is up-to-date:• Read miss: requesting processor is sent back the data from memory &

    requesting processor is added to the sharing set.• Write miss: requesting processor is sent the value. All processors in

    the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

  • 59Example Directory Protocol

    Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

    • Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.

    • Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.

    • Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

  • 60Example

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1

    P1: Read A1P2: Read A1

    P2: Write 40 to A2

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    Processor 1 Processor 2 Interconnect MemoryDirectory

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    P1: Read A1

    P2: Read A1

    P2: Write 40 to A2

  • 61Example

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

    P1: Read A1P2: Read A1

    P2: Write 40 to A2

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    Processor 1 Processor 2 Interconnect MemoryDirectory

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    WrMs

    P1

    A1

    A1

    Ex

    {P1}

    Excl.

    A1

    10.0

    DaRp

    P1

    A1

    0.0

    P1: Read A1

    P2: Read A1

    P2: Write 40 to A2

  • 62Example

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

    P1: Read A1 Excl. A1 10P2: Read A1

    P2: Write 40 to A2

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    Processor 1 Processor 2 Interconnect MemoryDirectory

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    WrMs

    P1

    A1

    A1

    Ex

    {P1}

    Excl.

    A1

    10.0

    DaRp

    P1

    A1

    0.0

    P1: Read A1

    Excl.

    A1

    10.0

    P2: Read A1

    P2: Write 40 to A2

  • 63Example

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

    P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10

    1010

    P2: Write 40 to A2 10

    Processor 1 Processor 2 Interconnect MemoryDirectory

    A1

    Write Back

    A1

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    WrMs

    P1

    A1

    A1

    Ex

    {P1}

    Excl.

    A1

    10.0

    DaRp

    P1

    A1

    0.0

    P1: Read A1

    Excl.

    A1

    10.0

    P2: Read A1

    Shar.

    A1

    RdMs

    P2

    A1

    Shar.

    A1

    10.0

    Ftch

    P1

    A1

    10.0

    10.0

    Shar.

    A1

    10.0

    DaRp

    P2

    A1

    10.0

    A1

    Shar.

    {P1,P2}

    10.0

    10.0

    10.0

    P2: Write 40 to A2

    10.0

  • 64Example

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

    P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10Excl. A1 20 WrMs P2 A1 10

    Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10

    Processor 1 Processor 2 Interconnect MemoryDirectory

    A1A1

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    WrMs

    P1

    A1

    A1

    Ex

    {P1}

    Excl.

    A1

    10.0

    DaRp

    P1

    A1

    0.0

    P1: Read A1

    Excl.

    A1

    10.0

    P2: Read A1

    Shar.

    A1

    RdMs

    P2

    A1

    Shar.

    A1

    10.0

    Ftch

    P1

    A1

    10.0

    10.0

    Shar.

    A1

    10.0

    DaRp

    P2

    A1

    10.0

    A1

    Shar.

    {P1,P2}

    10.0

    Excl.

    A1

    20.0

    WrMs

    P2

    A1

    10.0

    Inv.

    Inval.

    P1

    A1

    A1

    Excl.

    {P2}

    10.0

    P2: Write 40 to A2

    10.0

  • 65Example

    P2: Write 20 to A1

    A1 and A2 map to the same cache block

    P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

    P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

    P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

    Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10Excl. A1 20 WrMs P2 A1 10

    Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

    WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

    Processor 1 Processor 2 Interconnect MemoryDirectory

    A1A1

    Sheet: Directory Start Example

    P1

    P2

    Bus

    Directory

    Memory

    step

    State

    Addr

    Value

    State

    Addr

    Value

    Action

    Proc.

    Addr

    Value

    Addr

    State

    {Procs}

    Value

    Comments

    P1: Write 10 to A1

    WrMs

    P1

    A1

    A1

    Ex

    {P1}

    Excl.

    A1

    10.0

    DaRp

    P1

    A1

    0.0

    P1: Read A1

    Excl.

    A1

    10.0

    P2: Read A1

    Shar.

    A1

    RdMs

    P2

    A1

    Shar.

    A1

    10.0

    Ftch

    P1

    A1

    10.0

    10.0

    Shar.

    A1

    10.0

    DaRp

    P2

    A1

    10.0

    A1

    Shar.

    {P1,P2}

    10.0

    Excl.

    A1

    20.0

    WrMs

    P2

    A1

    10.0

    Inv.

    Inval.

    P1

    A1

    A1

    Excl.

    {P2}

    10.0

    P2: Write 40 to A2

    WrMs

    P2

    A2

    A2

    Excl.

    {P2}

    0.0

    WrBk

    P2

    A1

    20.0

    A1

    Unca.

    {}

    20.0

    Excl.

    A2

    40.0

    DaRp

    P2

    A2

    0.0

    A2

    Excl.

    {P2}

    0.0

  • 66Implementing a Directory

    We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of bufffers in network (see Appendix E)

    Optimizations:• read miss or write miss in Exclusive: send data directly

    to requestor from owner vs. 1st to memory and then from memory to requestor

  • 67Synchronization

    Why Synchronize? Need to know when it is safe for different processes to use shared data

    Issues for Synchronization:• Uninterruptable instruction to fetch and update memory

    (atomic operation);• User level synchronization operation using this primitive;• For large scale MPs, synchronization can be a bottleneck;

    techniques to reduce contention and latency of synchronization

  • 68

    Uninterruptable Instruction to Fetch and Update Memory

    Atomic exchange: interchange a value in a register for a value in memory

    0 => synchronization variable is free 1 => synchronization variable is locked and unavailable• Set register to 1 & swap• New value in register determines success in getting lock

    0 if you succeeded in setting the lock (you were first)1 if other processor had already claimed access

    • Key is that exchange operation is indivisible

    Test-and-set: tests a value and sets it if the value passes the testFetch-and-increment: it returns the value of a memory location and atomically increments it

    • 0 => synchronization variable is free

  • 69

    Uninterruptable Instruction to Fetch and Update Memory

    Hard to have read & write in 1 instruction: use 2 insteadLoad linked (or load locked) + store conditional

    • Load linked returns the initial value• Store conditional returns 1 if it succeeds (no other store to same

    memory location since preceeding load) and 0 otherwise

    Example doing atomic swap with LL & SC:try: mov R3,R4 ; mov exchange valuell R2,0(R1); load linkedsc R3,0(R1); store conditionalbeqz R3,try ; branch store fails (R3 = 0)mov R4,R2 ; put load value in R4

    Example doing fetch & increment with LL & SC:try: ll R2,0(R1); load linkedaddi R2,R2,#1 ; increment (OK if reg–reg)sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

  • 70

    User Level Synchronization—Operation Using this Primitive

    Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

    li R2,#1lockit: exch R2,0(R1) ;atomic exchange

    bnez R2,lockit ;already locked?

    What about MP with cache coherency?• Want to spin on cache copy to avoid full memory latency• Likely to get cache hits for such variables

    Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus trafficSolution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”):

    try: li R2,#1lockit: lw R3,0(R1) ;load var

    bnez R3,lockit ;not free=>spinexch R2,0(R1) ;atomic exchangebnez R2,try ;already locked?

  • 71

    Another MP Issue:Memory Consistency Models

    What is consistency? When must a processor see the new value? e.g., seems that

    P1: A = 0; P2: B = 0;..... .....

    A = 1; B = 1;L1: if (B == 0) ... L2: if (A == 0) ...

    Impossible for both if statements L1 & L2 to be true?• What if write invalidate is delayed & processor continues?

    Memory consistency models: what are the rules for such cases?Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above

    • SC: delay all memory accesses until all invalidates done

  • 72Memory Consistency Model

    Schemes faster execution to sequential consistencyNot really an issue for most programs; they are synchronized

    • A program is synchronized if all access to shared data are ordered by synchronization operations

    write (x)...release (s) {unlock}...acquire (s) {lock}...read(x)

    Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

  • 73Summary

    Caches contain all information on state of cached memory blocks

    Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)

    Directory has extra data structure to keep track of state of all cache blocks

    Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access

  • 74Review

    Caches contain all information on state of cached memory blocks

    Snooping and Directory Protocols similar

    Bus makes snooping easier because of broadcast (snooping => Uniform Memory Access)

    Directory has extra data structure to keep track of state of all cache blocks

    Distributing directory => scalable shared address multiprocessor => Cache coherent, Non Uniform Memory Access (NUMA)

  • 75Parallel App: Commercial WorkloadOnline transaction processing workload (OLTP) (like TPC-B or -C)

    Decision support system (DSS) (like TPC-D)

    Web index search (Altavista)

    Benchmark % TimeUserMode

    % TimeKernel

    % TimeI/O time(CPU Idle)

    OLTP 71% 18% 11%

    DSS (range) 82-94% 3-5% 4-13%

    DSS (avg) 87% 4% 9%

    Altavista > 98% < 1% 98%

    < 1%

  • 76Alpha 4100 SMP

    4 CPUs

    300 MHz Apha 211264 @ 300 MHz

    L1$ 8KB direct mapped, write through

    L2$ 96KB, 3-way set associative

    L3$ 2MB (off chip), direct mapped

    Memory latency 80 clock cycles

    Cache to cache 125 clock cycles

  • 77OLTP Performance as vary L3$ size

    OLTP execution time vs. cache

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8MB8MB8MB8MB8MB

    Instruction Execution

    L2/L3 Cache Access

    Memory Access

    PAL Code

    Idle

    L3 Cache Size

    Normalized Execution Time

    14.724087

    25.577738

    47.733365

    6.136279

    7.594701

    11.580399

    22.576476

    21.534974

    5.372016

    18.557369

    11.449203

    23.652949

    13.776782

    5.002901

    23.025655

    11.437336

    24.065881

    11.30766

    4.88

    25.44

    overall performance

    OLTPOLTPOLTPOLTPOLTP

    DSSDSSDSSDSSDSS

    AVAVAVAVAV

    Instruction execution

    L2 Access

    L3 Access

    Memory Access

    Other stalls

    Fraction of Execution Time

    10

    14

    24

    39

    13

    47.3333333333

    25.6666666667

    4.5

    6.1666666667

    16.3333333333

    52

    17

    2

    11

    18

    L3 miss breakdown

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8 MB8 MB8 MB8 MB8 MB

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Instruction

    Cache size

    Memory cycles per instruction

    0.6512868427

    0.0773471124

    0.1021213146

    1.0248247865

    1.3753797753

    0.6487216629

    0.0719602697

    0.0920437079

    0.4956711573

    0.4787423483

    0.7013416742

    0.070965618

    0.0731213933

    0.2156517978

    0.115407809

    0.7089221011

    0.0702448989

    0.0490165506

    0.1006045955

    0.0323258989

    cycles vs processor count

    11111

    22222

    44444

    66666

    88888

    True Sharing

    False Sharing

    Cold

    Conflict/Capacity

    Instruction

    Processor count

    Memory cycles per instruction

    0

    0

    0.0567118427

    0.5457095393

    0.5528361573

    0.3254800899

    0.0375244719

    0.0661505281

    0.5119018989

    0.5633609213

    0.6235548876

    0.0727557865

    0.0975642584

    0.5200540225

    0.4676312584

    0.9013379663

    0.1055005843

    0.1468118427

    0.6365803371

    0.6098165281

    1.2183957303

    0.1378625056

    0.2174786404

    0.7185191348

    0.6057956742

    misses versus block size

    3232323232

    6464646464

    128128128128128

    256256256256256

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Insruction

    Block size in bytes

    Misses per 1,000 instructions

    5.8915905618

    0.3366188764

    1.2837953933

    4.3034049438

    4.3539332584

    4.1065416854

    0.4630585393

    0.8213173034

    3.8563835955

    4.2756446067

    2.9492957303

    0.5748061798

    0.5202086517

    3.5625402247

    4.0900161798

    2.0993595506

    0.636398764

    0.3382467416

    3.4438544944

    4.0825640449

    commercial mp workload

    Figure3

    single issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.317.933.116.71.591.05

    DSS28.113.233.125.61.993.4

    DSS30.514.732.522.31.892.65

    DSS401524211.592.5

    DSS29.113.137.120.11.592.85

    DSS35.815.223.925.11.592.4

    AV28.68123.36619.91528.0361.388.315palidle

    6.1362797.594701

    Basic performancesingle issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.633333333314.8530.616666666721.81.616666666792.475

    AV28.68123.36619.91528.0361.388.315

    Basic performanceInstruction executionInstruction cache stallData cache stall

    OLTP0.59310893843.32203874183.08485231987

    DSS0.7003078910.53524676340.38111201231.6166666667

    AV0.59415954250.29314952160.41269093591.3

    5.37201618.557369

    Figure45.00290123.025655

    L2 AccessL3 AccessMemory AccessInstruction executionMemory barrierReplay trapsBranch mispredictTLB MissOther stalls

    OLTP14243910352313

    DSS314249086014

    DSS258941194317

    DSS2546461152119

    DSS253555075012

    DSS2347431192123

    DSS254850174113

    AV1721152449118

    Instruction executionL2 AccessL3 AccessMemory AccessOther stalls

    OLTP1014243913

    DSS47.333333333325.66666666674.56.166666666716.3333333333

    AV521721118

    Figure5a

    ---------

    TITLE(a)ExecutionTime

    GROUPS4

    SETS2

    SECTIONS8

    GROUP_LABELS1MB2MB4MB8MB1MB2MB4MB8MB

    SET_LABELS1w2w

    Y_LABELNormalizedExecutionTime101.76617

    Instruction ExecutionL2/L3 Cache AccessMemory AccessPAL CodeIdleKernel Instr.kernel-CACHEkernel-MEM

    1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

    2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

    4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

    8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

    Figure5b

    ---------

    Kernel time0.202247191Use time0.7977528093.1775280899

    user partTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.6194590.0531990.0685591.0834521.617896

    2 MB0.6261980.0498320.0621540.5207490.558685

    4 MB0.6818230.0478520.050640.2201880.129771

    8 MB0.6911830.047070.0355570.0944670.031777

    kernel part0.776830.1725980.2345060.7935730.418788

    kernel part0.7375650.1592440.2099420.3967530.163413

    kernel part0.7783320.1621360.1617980.1977590.058753

    kernel part0.7788930.1616570.1021070.1248140.034491

    CombinedTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

    2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

    4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

    8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

    Figure6

    --------

    Y_LABELMemorycyclesperinstruction

    SECTION_LABELSTrue SharingFalse SharingColdConflict/CapacityInstruction

    Duser1000.027890.5397030.611124

    Duser20.2994260.0235320.0371770.5385330.618988

    Duser40.5931630.0504730.0659510.5470520.538618

    Duser60.8249130.0671980.1109340.6472960.695899

    Duser81.1333160.0882990.1740950.7150870.692495

    Dkernel1000.1703980.5694020.322923

    Dkernel20.4282490.0927170.1804350.4068570.343943

    Dkernel40.7434340.1606490.2222610.4135620.187628

    Dkernel61.2027920.2565830.288330.5943130.270269

    Dkernel81.5539880.3333630.3886030.7320570.263815

    Procesor countTrue SharingFalse SharingColdConflict/CapacityInstructionTotal

    1000.05671184270.54570953930.55283615731.1552575393

    20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

    40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

    60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

    81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

    Figure8

    --------

    GROUPS4

    SETS4

    SECTIONS5

    GROUP_LABELSOTLP-UserOLTP-KernelDSS-Q5DSS-Q6

    SET_LABELS3264128256

    SECTION_LABELS

    Y_LABELMissesper100instructions

    True SharingFalse SharingColdCapacity/ConflictInsruction

    user320.5865980.0213240.0799110.460050.502738

    user640.3969590.0321530.0547420.4142980.498209

    user1280.272090.0421610.0376150.383180.478084

    user2560.1878720.0484590.0273020.3735590.47617

    kernel320.5992610.0823280.3195610.3131530.169756

    kernel640.4646740.1021310.1901690.2725920.148911

    kernel1280.3850190.1179080.1088440.2500460.13651

    kernel2560.2969660.123520.0595530.2293120.140375

    Misses per 1000 instruction

    True SharingFalse SharingColdCapacity/ConflictInsruction

    325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

    644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

    1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

    2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

  • 78L3 Miss Breakdown

    OLTP execution time vs. cache

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8MB8MB8MB8MB8MB

    Instruction Execution

    L2/L3 Cache Access

    Memory Access

    PAL Code

    Idle

    L3 Cache Size

    Normalized Execution Time

    14.724087

    25.577738

    47.733365

    6.136279

    7.594701

    11.580399

    22.576476

    21.534974

    5.372016

    18.557369

    11.449203

    23.652949

    13.776782

    5.002901

    23.025655

    11.437336

    24.065881

    11.30766

    4.88

    25.44

    overall performance

    OLTPOLTPOLTPOLTPOLTP

    DSSDSSDSSDSSDSS

    AVAVAVAVAV

    Instruction execution

    L2 Access

    L3 Access

    Memory Access

    Other stalls

    Fraction of Execution Time

    10

    14

    24

    39

    13

    47.3333333333

    25.6666666667

    4.5

    6.1666666667

    16.3333333333

    52

    17

    2

    11

    18

    L3 miss breakdown

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8 MB8 MB8 MB8 MB8 MB

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Instruction

    Cache size

    Memory cycles per instruction

    0.6512868427

    0.0773471124

    0.1021213146

    1.0248247865

    1.3753797753

    0.6487216629

    0.0719602697

    0.0920437079

    0.4956711573

    0.4787423483

    0.7013416742

    0.070965618

    0.0731213933

    0.2156517978

    0.115407809

    0.7089221011

    0.0702448989

    0.0490165506

    0.1006045955

    0.0323258989

    cycles vs processor count

    11111

    22222

    44444

    66666

    88888

    True Sharing

    False Sharing

    Cold

    Conflict/Capacity

    Instruction

    Processor count

    Memory cycles per instruction

    0

    0

    0.0567118427

    0.5457095393

    0.5528361573

    0.3254800899

    0.0375244719

    0.0661505281

    0.5119018989

    0.5633609213

    0.6235548876

    0.0727557865

    0.0975642584

    0.5200540225

    0.4676312584

    0.9013379663

    0.1055005843

    0.1468118427

    0.6365803371

    0.6098165281

    1.2183957303

    0.1378625056

    0.2174786404

    0.7185191348

    0.6057956742

    misses versus block size

    3232323232

    6464646464

    128128128128128

    256256256256256

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Insruction

    Block size in bytes

    Misses per 1,000 instructions

    5.8915905618

    0.3366188764

    1.2837953933

    4.3034049438

    4.3539332584

    4.1065416854

    0.4630585393

    0.8213173034

    3.8563835955

    4.2756446067

    2.9492957303

    0.5748061798

    0.5202086517

    3.5625402247

    4.0900161798

    2.0993595506

    0.636398764

    0.3382467416

    3.4438544944

    4.0825640449

    commercial mp workload

    Figure3

    single issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.317.933.116.71.591.05

    DSS28.113.233.125.61.993.4

    DSS30.514.732.522.31.892.65

    DSS401524211.592.5

    DSS29.113.137.120.11.592.85

    DSS35.815.223.925.11.592.4

    AV28.68123.36619.91528.0361.388.315palidle

    6.1362797.594701

    Basic performancesingle issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.633333333314.8530.616666666721.81.616666666792.475

    AV28.68123.36619.91528.0361.388.315

    Basic performanceInstruction executionInstruction cache stallData cache stall

    OLTP0.59310893843.32203874183.08485231987

    DSS0.7003078910.53524676340.38111201231.6166666667

    AV0.59415954250.29314952160.41269093591.3

    5.37201618.557369

    Figure45.00290123.025655

    L2 AccessL3 AccessMemory AccessInstruction executionMemory barrierReplay trapsBranch mispredictTLB MissOther stalls

    OLTP14243910352313

    DSS314249086014

    DSS258941194317

    DSS2546461152119

    DSS253555075012

    DSS2347431192123

    DSS254850174113

    AV1721152449118

    Instruction executionL2 AccessL3 AccessMemory AccessOther stalls

    OLTP1014243913

    DSS47.333333333325.66666666674.56.166666666716.3333333333

    AV521721118

    Figure5a

    ---------

    TITLE(a)ExecutionTime

    GROUPS4

    SETS2

    SECTIONS8

    GROUP_LABELS1MB2MB4MB8MB1MB2MB4MB8MB

    SET_LABELS1w2w

    Y_LABELNormalizedExecutionTime101.76617

    Instruction ExecutionL2/L3 Cache AccessMemory AccessPAL CodeIdleKernel Instr.kernel-CACHEkernel-MEM

    1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

    2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

    4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

    8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

    Figure5b

    ---------

    Kernel time0.202247191Use time0.7977528093.1775280899

    user partTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.6194590.0531990.0685591.0834521.617896

    2 MB0.6261980.0498320.0621540.5207490.558685

    4 MB0.6818230.0478520.050640.2201880.129771

    8 MB0.6911830.047070.0355570.0944670.031777

    kernel part0.776830.1725980.2345060.7935730.418788

    kernel part0.7375650.1592440.2099420.3967530.163413

    kernel part0.7783320.1621360.1617980.1977590.058753

    kernel part0.7788930.1616570.1021070.1248140.034491

    CombinedTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

    2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

    4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

    8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

    Figure6

    --------

    Y_LABELMemorycyclesperinstruction

    SECTION_LABELSTrue SharingFalse SharingColdConflict/CapacityInstruction

    Duser1000.027890.5397030.611124

    Duser20.2994260.0235320.0371770.5385330.618988

    Duser40.5931630.0504730.0659510.5470520.538618

    Duser60.8249130.0671980.1109340.6472960.695899

    Duser81.1333160.0882990.1740950.7150870.692495

    Dkernel1000.1703980.5694020.322923

    Dkernel20.4282490.0927170.1804350.4068570.343943

    Dkernel40.7434340.1606490.2222610.4135620.187628

    Dkernel61.2027920.2565830.288330.5943130.270269

    Dkernel81.5539880.3333630.3886030.7320570.263815

    Procesor countTrue SharingFalse SharingColdConflict/CapacityInstructionTotal

    1000.05671184270.54570953930.55283615731.1552575393

    20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

    40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

    60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

    81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

    Figure8

    --------

    GROUPS4

    SETS4

    SECTIONS5

    GROUP_LABELSOTLP-UserOLTP-KernelDSS-Q5DSS-Q6

    SET_LABELS3264128256

    SECTION_LABELS

    Y_LABELMissesper100instructions

    True SharingFalse SharingColdCapacity/ConflictInsruction

    user320.5865980.0213240.0799110.460050.502738

    user640.3969590.0321530.0547420.4142980.498209

    user1280.272090.0421610.0376150.383180.478084

    user2560.1878720.0484590.0273020.3735590.47617

    kernel320.5992610.0823280.3195610.3131530.169756

    kernel640.4646740.1021310.1901690.2725920.148911

    kernel1280.3850190.1179080.1088440.2500460.13651

    kernel2560.2969660.123520.0595530.2293120.140375

    Misses per 1000 instruction

    True SharingFalse SharingColdCapacity/ConflictInsruction

    325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

    644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

    1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

    2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

  • 79Memory CPI as increase CPUs

    OLTP execution time vs. cache

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8MB8MB8MB8MB8MB

    Instruction Execution

    L2/L3 Cache Access

    Memory Access

    PAL Code

    Idle

    L3 Cache Size

    Normalized Execution Time

    14.724087

    25.577738

    47.733365

    6.136279

    7.594701

    11.580399

    22.576476

    21.534974

    5.372016

    18.557369

    11.449203

    23.652949

    13.776782

    5.002901

    23.025655

    11.437336

    24.065881

    11.30766

    4.88

    25.44

    overall performance

    OLTPOLTPOLTPOLTPOLTP

    DSSDSSDSSDSSDSS

    AVAVAVAVAV

    Instruction execution

    L2 Access

    L3 Access

    Memory Access

    Other stalls

    Fraction of Execution Time

    10

    14

    24

    39

    13

    47.3333333333

    25.6666666667

    4.5

    6.1666666667

    16.3333333333

    52

    17

    2

    11

    18

    L3 miss breakdown

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8 MB8 MB8 MB8 MB8 MB

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Instruction

    Cache size

    Memory cycles per instruction

    0.6512868427

    0.0773471124

    0.1021213146

    1.0248247865

    1.3753797753

    0.6487216629

    0.0719602697

    0.0920437079

    0.4956711573

    0.4787423483

    0.7013416742

    0.070965618

    0.0731213933

    0.2156517978

    0.115407809

    0.7089221011

    0.0702448989

    0.0490165506

    0.1006045955

    0.0323258989

    cycles vs processor count

    11111

    22222

    44444

    66666

    88888

    True Sharing

    False Sharing

    Cold

    Conflict/Capacity

    Instruction

    Processor count

    Memory cycles per instruction

    0

    0

    0.0567118427

    0.5457095393

    0.5528361573

    0.3254800899

    0.0375244719

    0.0661505281

    0.5119018989

    0.5633609213

    0.6235548876

    0.0727557865

    0.0975642584

    0.5200540225

    0.4676312584

    0.9013379663

    0.1055005843

    0.1468118427

    0.6365803371

    0.6098165281

    1.2183957303

    0.1378625056

    0.2174786404

    0.7185191348

    0.6057956742

    misses versus block size

    3232323232

    6464646464

    128128128128128

    256256256256256

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Insruction

    Block size in bytes

    Misses per 1,000 instructions

    5.8915905618

    0.3366188764

    1.2837953933

    4.3034049438

    4.3539332584

    4.1065416854

    0.4630585393

    0.8213173034

    3.8563835955

    4.2756446067

    2.9492957303

    0.5748061798

    0.5202086517

    3.5625402247

    4.0900161798

    2.0993595506

    0.636398764

    0.3382467416

    3.4438544944

    4.0825640449

    commercial mp workload

    Figure3

    single issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.317.933.116.71.591.05

    DSS28.113.233.125.61.993.4

    DSS30.514.732.522.31.892.65

    DSS401524211.592.5

    DSS29.113.137.120.11.592.85

    DSS35.815.223.925.11.592.4

    AV28.68123.36619.91528.0361.388.315palidle

    6.1362797.594701

    Basic performancesingle issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.633333333314.8530.616666666721.81.616666666792.475

    AV28.68123.36619.91528.0361.388.315

    Basic performanceInstruction executionInstruction cache stallData cache stall

    OLTP0.59310893843.32203874183.08485231987

    DSS0.7003078910.53524676340.38111201231.6166666667

    AV0.59415954250.29314952160.41269093591.3

    5.37201618.557369

    Figure45.00290123.025655

    L2 AccessL3 AccessMemory AccessInstruction executionMemory barrierReplay trapsBranch mispredictTLB MissOther stalls

    OLTP14243910352313

    DSS314249086014

    DSS258941194317

    DSS2546461152119

    DSS253555075012

    DSS2347431192123

    DSS254850174113

    AV1721152449118

    Instruction executionL2 AccessL3 AccessMemory AccessOther stalls

    OLTP1014243913

    DSS47.333333333325.66666666674.56.166666666716.3333333333

    AV521721118

    Figure5a

    ---------

    TITLE(a)ExecutionTime

    GROUPS4

    SETS2

    SECTIONS8

    GROUP_LABELS1MB2MB4MB8MB1MB2MB4MB8MB

    SET_LABELS1w2w

    Y_LABELNormalizedExecutionTime101.76617

    Instruction ExecutionL2/L3 Cache AccessMemory AccessPAL CodeIdleKernel Instr.kernel-CACHEkernel-MEM

    1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

    2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

    4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

    8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

    Figure5b

    ---------

    Kernel time0.202247191Use time0.7977528093.1775280899

    user partTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.6194590.0531990.0685591.0834521.617896

    2 MB0.6261980.0498320.0621540.5207490.558685

    4 MB0.6818230.0478520.050640.2201880.129771

    8 MB0.6911830.047070.0355570.0944670.031777

    kernel part0.776830.1725980.2345060.7935730.418788

    kernel part0.7375650.1592440.2099420.3967530.163413

    kernel part0.7783320.1621360.1617980.1977590.058753

    kernel part0.7788930.1616570.1021070.1248140.034491

    CombinedTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

    2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

    4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

    8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

    Figure6

    --------

    Y_LABELMemorycyclesperinstruction

    SECTION_LABELSTrue SharingFalse SharingColdConflict/CapacityInstruction

    Duser1000.027890.5397030.611124

    Duser20.2994260.0235320.0371770.5385330.618988

    Duser40.5931630.0504730.0659510.5470520.538618

    Duser60.8249130.0671980.1109340.6472960.695899

    Duser81.1333160.0882990.1740950.7150870.692495

    Dkernel1000.1703980.5694020.322923

    Dkernel20.4282490.0927170.1804350.4068570.343943

    Dkernel40.7434340.1606490.2222610.4135620.187628

    Dkernel61.2027920.2565830.288330.5943130.270269

    Dkernel81.5539880.3333630.3886030.7320570.263815

    Procesor countTrue SharingFalse SharingColdConflict/CapacityInstructionTotal

    1000.05671184270.54570953930.55283615731.1552575393

    20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

    40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

    60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

    81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

    Figure8

    --------

    GROUPS4

    SETS4

    SECTIONS5

    GROUP_LABELSOTLP-UserOLTP-KernelDSS-Q5DSS-Q6

    SET_LABELS3264128256

    SECTION_LABELS

    Y_LABELMissesper100instructions

    True SharingFalse SharingColdCapacity/ConflictInsruction

    user320.5865980.0213240.0799110.460050.502738

    user640.3969590.0321530.0547420.4142980.498209

    user1280.272090.0421610.0376150.383180.478084

    user2560.1878720.0484590.0273020.3735590.47617

    kernel320.5992610.0823280.3195610.3131530.169756

    kernel640.4646740.1021310.1901690.2725920.148911

    kernel1280.3850190.1179080.1088440.2500460.13651

    kernel2560.2969660.123520.0595530.2293120.140375

    Misses per 1000 instruction

    True SharingFalse SharingColdCapacity/ConflictInsruction

    325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

    644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

    1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

    2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

  • 80OLTP Performance as vary L3$ size

    OLTP execution time vs. cache

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8MB8MB8MB8MB8MB

    Instruction Execution

    L2/L3 Cache Access

    Memory Access

    PAL Code

    Idle

    L3 Cache Size

    Normalized Execution Time

    14.724087

    25.577738

    47.733365

    6.136279

    7.594701

    11.580399

    22.576476

    21.534974

    5.372016

    18.557369

    11.449203

    23.652949

    13.776782

    5.002901

    23.025655

    11.437336

    24.065881

    11.30766

    4.88

    25.44

    overall performance

    OLTPOLTPOLTPOLTPOLTP

    DSSDSSDSSDSSDSS

    AVAVAVAVAV

    Instruction execution

    L2 Access

    L3 Access

    Memory Access

    Other stalls

    Fraction of Execution Time

    10

    14

    24

    39

    13

    47.3333333333

    25.6666666667

    4.5

    6.1666666667

    16.3333333333

    52

    17

    2

    11

    18

    L3 miss breakdown

    1 MB1 MB1 MB1 MB1 MB

    2 MB2 MB2 MB2 MB2 MB

    4 MB4 MB4 MB4 MB4 MB

    8 MB8 MB8 MB8 MB8 MB

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Instruction

    Cache size

    Memory cycles per instruction

    0.6512868427

    0.0773471124

    0.1021213146

    1.0248247865

    1.3753797753

    0.6487216629

    0.0719602697

    0.0920437079

    0.4956711573

    0.4787423483

    0.7013416742

    0.070965618

    0.0731213933

    0.2156517978

    0.115407809

    0.7089221011

    0.0702448989

    0.0490165506

    0.1006045955

    0.0323258989

    cycles vs processor count

    11111

    22222

    44444

    66666

    88888

    True Sharing

    False Sharing

    Cold

    Conflict/Capacity

    Instruction

    Processor count

    Memory cycles per instruction

    0

    0

    0.0567118427

    0.5457095393

    0.5528361573

    0.3254800899

    0.0375244719

    0.0661505281

    0.5119018989

    0.5633609213

    0.6235548876

    0.0727557865

    0.0975642584

    0.5200540225

    0.4676312584

    0.9013379663

    0.1055005843

    0.1468118427

    0.6365803371

    0.6098165281

    1.2183957303

    0.1378625056

    0.2174786404

    0.7185191348

    0.6057956742

    misses versus block size

    3232323232

    6464646464

    128128128128128

    256256256256256

    True Sharing

    False Sharing

    Cold

    Capacity/Conflict

    Insruction

    Block size in bytes

    Misses per 1,000 instructions

    5.8915905618

    0.3366188764

    1.2837953933

    4.3034049438

    4.3539332584

    4.1065416854

    0.4630585393

    0.8213173034

    3.8563835955

    4.2756446067

    2.9492957303

    0.5748061798

    0.5202086517

    3.5625402247

    4.0900161798

    2.0993595506

    0.636398764

    0.3382467416

    3.4438544944

    4.0825640449

    commercial mp workload

    Figure3

    single issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.317.933.116.71.591.05

    DSS28.113.233.125.61.993.4

    DSS30.514.732.522.31.892.65

    DSS401524211.592.5

    DSS29.113.137.120.11.592.85

    DSS35.815.223.925.11.592.4

    AV28.68123.36619.91528.0361.388.315palidle

    6.1362797.594701

    Basic performancesingle issuedual issueinstruction stallsdata stallsCPI

    OLTP6.2974.0146.543.18797.982

    DSS32.633333333314.8530.616666666721.81.616666666792.475

    AV28.68123.36619.91528.0361.388.315

    Basic performanceInstruction executionInstruction cache stallData cache stall

    OLTP0.59310893843.32203874183.08485231987

    DSS0.7003078910.53524676340.38111201231.6166666667

    AV0.59415954250.29314952160.41269093591.3

    5.37201618.557369

    Figure45.00290123.025655

    L2 AccessL3 AccessMemory AccessInstruction executionMemory barrierReplay trapsBranch mispredictTLB MissOther stalls

    OLTP14243910352313

    DSS314249086014

    DSS258941194317

    DSS2546461152119

    DSS253555075012

    DSS2347431192123

    DSS254850174113

    AV1721152449118

    Instruction executionL2 AccessL3 AccessMemory AccessOther stalls

    OLTP1014243913

    DSS47.333333333325.66666666674.56.166666666716.3333333333

    AV521721118

    Figure5a

    ---------

    TITLE(a)ExecutionTime

    GROUPS4

    SETS2

    SECTIONS8

    GROUP_LABELS1MB2MB4MB8MB1MB2MB4MB8MB

    SET_LABELS1w2w

    Y_LABELNormalizedExecutionTime101.76617

    Instruction ExecutionL2/L3 Cache AccessMemory AccessPAL CodeIdleKernel Instr.kernel-CACHEkernel-MEM

    1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

    2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

    4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

    8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

    Figure5b

    ---------

    Kernel time0.202247191Use time0.7977528093.1775280899

    user partTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.6194590.0531990.0685591.0834521.617896

    2 MB0.6261980.0498320.0621540.5207490.558685

    4 MB0.6818230.0478520.050640.2201880.129771

    8 MB0.6911830.047070.0355570.0944670.031777

    kernel part0.776830.1725980.2345060.7935730.418788

    kernel part0.7375650.1592440.2099420.3967530.163413

    kernel part0.7783320.1621360.1617980.1977590.058753

    kernel part0.7788930.1616570.1021070.1248140.034491

    CombinedTrue SharingFalse SharingColdCapacity/ConflictInstruction

    1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

    2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

    4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

    8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

    Figure6

    --------

    Y_LABELMemorycyclesperinstruction

    SECTION_LABELSTrue SharingFalse SharingColdConflict/CapacityInstruction

    Duser1000.027890.5397030.611124

    Duser20.2994260.0235320.0371770.5385330.618988

    Duser40.5931630.0504730.0659510.5470520.538618

    Duser60.8249130.0671980.1109340.6472960.695899

    Duser81.1333160.0882990.1740950.7150870.692495

    Dkernel1000.1703980.5694020.322923

    Dkernel20.4282490.0927170.1804350.4068570.343943

    Dkernel40.7434340.1606490.2222610.4135620.187628

    Dkernel61.2027920.2565830.288330.5943130.270269

    Dkernel81.5539880.3333630.3886030.7320570.263815

    Procesor countTrue SharingFalse SharingColdConflict/CapacityInstructionTotal

    1000.05671184270.54570953930.55283615731.1552575393

    20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

    40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

    60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

    81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

    Figure8

    --------

    GROUPS4

    SETS4

    SECTIONS5

    GROUP_LABELSOTLP-UserOLTP-KernelDSS-Q5DSS-Q6

    SET_LABELS3264128256

    SECTION_LABELS

    Y_LABELMissesper100instructions

    True SharingFalse SharingColdCapacity/ConflictInsruction

    user320.5865980.0213240.0799110.460050.502738

    user640.3969590.0321530.0547420.4142980.498209

    user1280.272090.0421610.0376150.383180.478084

    user2560.1878720.0484590.0273020.3735590.47617

    kernel320.5992610.0823280.3195610.3131530.169756

    kernel640.4646740.1021310.1901690.2725920.148911

    kernel1280.3850190.1179080.1088440.2500460.13651

    kernel2560.2969660.123520.0595530.2293120.140375

    Misses per 1000 instruction

    True SharingFalse SharingColdCapacity/ConflictInsruction

    325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

    644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

    1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

    2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

  • 81

    NUMA Memory performance for Scientific Apps on SGI Origin 2000

    Show average cycles per memory reference in 4 categories:

    Cache Hit

    Miss to local memory

    Remote miss to home

    3-network hop miss to remote cache

  • 82SGI Origin 2000

    a pure NUMA

    2 CPUs per node,

    Scales up to 2048 processors

    Design for scientific computation vs. commercial processing

    Scalable bandwidth is crucial to Origin

  • 83Parallel App: Scientific/TechnicalFFT Kernel: 1D complex number FFT

    • 2 matrix transpose phases => all-to-all communication• Sequential time for n data points: O(n log n)• Example is 1 million point data set

    LU Kernel: dense matrix factorization• Blocking helps cache miss rate, 16x16• Sequential time for nxn matrix: O(n3)• Example is 512 x 512 matrix

  • 84FFT Kernel

    barnes model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per memory reference

    Barnes

    0.9993

    0.017

    0

    0.056

    0.9993

    0.017

    0

    0.056

    0.9993

    0.017

    0

    0.068

    0.9993

    0.017

    0

    0.068

    fft model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per reference

    FFT

    0.9562

    3.281

    0

    0.728

    0.9589

    3.009

    0

    0.784

    0.9753

    1.598

    0.0125

    1.003

    0.9799

    1.19

    0.0125

    1.02

    lu model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per reference

    LU

    0.9925

    0.493

    0

    0.252

    0.9932

    0.425

    0

    0.252

    0.9931

    0.3995

    0

    0.374

    0.9945

    0.272

    0.0125

    0.357

    ocean model

    8888

    16161616

    32323232

    64646464

    Cache hit

    Local miss

    Remote miss

    3-hop miss to remote cache

    Average cycles per reference

    Ocean

    0.9768

    1.734

    0

    0.392

    0.9859

    0.6885

    0.4

    0.392

    0.9842

    0.6715

    0.765

    0.476

    0.9584

    2.312

    1.74

    0.476

    performance model 8->64 proc

    Directory based performance8-> 64 processors

    cache size512KB

    block size64bytes

    16

    miss local8585

    miss home125150

    miss remote140170

    collect data

    procinstruction countlocal penaltiesAll remote accessesThree hopMiss penaltyTotal Miss penelty/referenceSync timemisses including all upgrades of shared clean linesall shared misses including upgradesMR localR MR RemoteWr Miss remoteWr Hit Sh Clean LocalWr Hit Sh Clean RemoteWr Hit Inval ReqWr Miss Inv Reqmisses satisfied locally in 1000 processor cyclesmisses remote in 1000 processor cyclesratio of local to remote misses (by place of service--doesn't count invalidations)

    FFT

    80.42.9580.5%0.50.7283.6860.00034.380.93.480.5200.380007.221.087.43

    160.42.7030.6%0.60.7843.4870.00094.110.923.180.5600.360006.61.176.32

    320.41.45350.6%0.61.0182.47150.00432.470.771.710.590.010.170003.531.223.13

    640.40.94350.6%0.61.0351.97850.01342.010.91.110.60.010.290002.281.262.3

    0

    LU0

    80.40.2890.2%0.20.2520.5410.17770.750.420.340.1800.2400.002800.950.53.23

    160.40.2380.2%0.20.2520.490.23440.680.40.280.1800.2200.004200.760.52.74

    320.40.21250.2%0.20.3740.58650.33320.690.440.250.2200.2200.006100.630.552.09

    640.40.1360.2%0.20.3720.5080.40510.550.380.160.2100.160.010.00800.370.491.46

    0

    Barnes0

    80.40.0170.0%0.00.0560.0730.01760.070.050.020.0400000000

    80.40.0170.0%0.00.0560.0730.04670.070.060.020.0400000000

    80.40.0170.0%0.00.0680.0850.06380.070.060.020.0400000000

    80.40.0170.0%0.00.0680.0850.10240.070.070.020.0400000000

    0

    Ocean0

    80.41.36850.3%0.30.3921.76050.07982.320.711.610.2800.4300.18810.01514.310.747.27

    160.40.24650.6%0.30.7921.03850.11811.411.120.290.600.5200.42790.01510.771.621.34

    320.40.16150.8%0.31.2411.40250.25091.581.390.190.7900.600.55190.01380.512.091

    640.41.43651.4%0.32.2163.65250.49064.162.471.691.4401.0300.78230.09624.513.851.88

    Volrend

    80.40.07650.0%0.00.710.020.09000.020.02360.00010.241.530.170.19

    160.40.04250.0%0.00.760.040.05000.030.02810.00010.131.750.070.29

    320.40.02550.0%0.00.780.040.03000.030.02910.00030.081.760.040.68

    640.40.0170.0%0.00.810.050.02000.030.02970.00050.041.760.032.11

    Anatomy of a Memory ReferenceTime weighted

    FFTHitMiss to local memoryRemote miss to home3-hop miss to remote cacheFFTHitMiss to local memoryRemote miss to home3-hop miss to remote cache

    896%4%0%1%81.03.30.00.7

    1695.9%4%0%1%161.03.00.00.8

    3298%2%0%1%321.01.60.01.0

    6498%1%0%1%641.01.20.01.0

    0100%0%0%0%01.00.00.00.0

    LU100%0%0%0%LUHitMiss to local memoryRemote miss to home3-hop miss to remote cache

    899%1%0%0%81.00.50.00.3

    1699%1%0%0%161.00.40.00.3

    3299%0%0%0%321.00.40.00.4

    6499%0%0%0%641.00.30.00.4

    0100%0%0%0%01.00.00.00.0

    Barnes100%0%0%0%BarnesHitMiss to local memoryRemote miss to home3-hop miss to remote cache

    8100%0%0%0%81.00.00.00.1

    1699.9%0.0%0.0%0.0%161.00.00.00.1

    32100%0%0%0%321.00.00.00.1

    64100%0%0%0%641.00.00.00.1

    0100%0%0%0%01.00.00.00.0

    Ocean100%0%0%0%OceanCache hitLocal missRemote miss3-hop miss to remote cache

    898%2%0%0%81.01.70.00.4

    1699%1%0%0%161.00.70.40.4

    3298%1%1%0%321.00.70.80.5

    6496%3%1%0%641.02.31.70.5

    1.6%1.4

    2.15

    &f

    Page &p

  • 85LU kernel

    barnes model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per memory reference

    Barnes

    0.9993

    0.017

    0

    0.056

    0.9993

    0.017

    0

    0.056

    0.9993

    0.017

    0

    0.068

    0.9993

    0.017

    0

    0.068

    fft model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per reference

    FFT

    0.9562

    3.281

    0

    0.728

    0.9589

    3.009

    0

    0.784

    0.9753

    1.598

    0.0125

    1.003

    0.9799

    1.19

    0.0125

    1.02

    lu model

    8888

    16161616

    32323232

    64646464

    Hit

    Miss to local memory

    Remote miss to home

    3-hop miss to remote cache

    Processor count

    Average cycles per reference

    LU

    0.9925

    0.493

    0

    0.252

    0.9932

    0.425

    0

    0.252

    0.9931

    0.3995

    0

    0.374

    0.9945

    0.272

    0.0125

    0.357

    ocean model

    8888

    16161616

    32323232

    64646464

    Cache hit

    Local miss

    Remote miss

    3-hop miss to remote cache

    Average cycles per reference

    Ocean

    0.9768

    1.734

    0

    0.392

    0.9859

    0.6885

    0.4

    0.392

    0.9842

    0.6715

    0.765

    0.476

    0.9584

    2.312

    1.74

    0.476

    performance model 8->64 proc

    Directory based performance8-> 64 processors

    cache size512KB

    block size64bytes

    16

    miss local8585

    miss home125150