lecture 1: course introduction and overviewsoner/classes/cs-4431/pdf-slides/lecture-12.pdf ·...

Multiprocessors

(based on D. Patterson’s lectures and Hennessy/Patterson’s book)

2Parallel Computers

Definition: “A parallel computer is a collection of processiong elements that cooperate and communicate to solve large problems fast.”

Almasi and Gottlieb, Highly Parallel Computing ,1989

Questions about parallel computers:• How large a collection?• How powerful are processing elements?• How do they cooperate and communicate?• How are data transmitted? • What type of interconnection?• What are HW and SW primitives for programmer?• Does it translate into performance?

3Parallel Processors “Religion”

The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processorLed to innovative organization tied to particular programming models since “uniprocessors can’t keep going”

• e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972, … , 1989

• Borders religious fervor: you must believe!• Fervor damped some when 1990s companies went out of business:

Thinking Machines, Kendall Square, ...

Argument instead is the “pull” of opportunity of scalable performance, not the “push” of uniprocessor performance plateau?

4What level Parallelism?Bit level parallelism: 1970 to ~1985

• 4 bits, 8 bit, 16 bit, 32 bit microprocessors

Instruction level parallelism (ILP): ~1985 through today

• Pipelining• Superscalar• VLIW• Out-of-Order execution• Limits to benefits of ILP?

Process Level or Thread level parallelism; mainstream for general purpose computing?

• Servers are parallel• Highend Desktop dual processor PC soon??

(or just the sell the socket?)

5Why Multiprocessors?Microprocessors as the fastest CPUs

• Collecting several much easier than redesigning 1

Complexity of current microprocessors• Do we have enough ideas to sustain 1.5X/yr?• Can we deliver such complexity on schedule?

Slow (but steady) improvement in parallel software (scientific apps, databases, OS)

Emergence of embedded and server markets driving microprocessors in addition to desktops• Embedded functional parallelism, producer/consumer model• Server figure of merit is tasks per hour vs. latency

6Parallel Processing Intro

Long term goal of the field: scale number processors to size of budget, desired performanceMachines today: Sun Enterprise 10000 (8/00)

• 64 400 MHz UltraSPARC® II CPUs,64 GB SDRAM memory, 868 18GB disk,tape

• $4,720,800 total • 64 CPUs 15%,64 GB DRAM 11%, disks 55%, cabinet 16%

($10,800 per processor or ~0.2% per processor)• Minimal E10K - 1 CPU, 1 GB DRAM, 0 disks, tape ~$286,700• $10,800 (4%) per CPU, plus $39,600 board/4 CPUs (~8%/CPU)

Machines today: Dell Workstation 220 (2/01)• 866 MHz Intel Pentium® III (in Minitower)• 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, nVIDIA

GeForce 2 GTS,32MB DDR Graphics card, 1yr service• $1,600; for extra processor, add $350 (~20%)

7Whither Supercomputing?Linpack (dense linear algebra) for Vector Supercomputers vs. Microprocessors“Attack of the Killer Micros”

• (see Chapter 1, Figure 1-10, page 22 of [CSG99])• 100 x 100 vs. 1000 x 1000

MPPs vs. Supercomputers when rewrite linpack to get peak performance

• (see Chapter 1, Figure 1-11, page 24 of [CSG99])

1997, 500 fastest machines in the world: 319 MPPs, 73 bus-based shared memory (SMP), 106 parallel vector processors (PVP)

• (see Chapter 1, Figure 1-12, page 24 of [CSG99])

2000, 381 of 500 fastest: 144 IBM SP (~cluster), 121 Sun (bus SMP), 62 SGI (NUMA SMP), 54 Cray (NUMA SMP)

[CSG99] = Parallel computer architecture : a hardware/ software approach, David E. Culler, Jaswinder Pal Singh, with Anoop Gupta. San Francisco : Morgan Kaufmann, c1999.

8

Popular Flynn Categories (e.g., ~RAID level for MPPs)SISD (Single Instruction Single Data)

• UniprocessorsMISD (Multiple Instruction Single Data)

• ???; multiple processors on a single data streamSIMD (Single Instruction Multiple Data)

• Examples: Illiac-IV, CM-2• Simple programming model• Low overhead• Flexibility• All custom integrated circuits

• (Phrase reused by Intel marketing for media instructions ~ vector)MIMD (Multiple Instruction Multiple Data)

• Examples: Sun Enterprise 5000, Cray T3D, SGI Origin• Flexible• Use off-the-shelf micros

MIMD current winner: Concentrate on major design emphasis

9Major MIMD Styles

Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor")

Decentralized memory (memory module with CPU) • get more memory bandwidth, lower memory latency• Drawback: Longer communication latency• Drawback: Software model more complex

10Decentralized Memory versions

Shared Memory with "Non Uniform Memory Access" time (NUMA)

Message passing "multicomputer" with separate address space per processor• Can invoke software with Remote Procedue Call (RPC)• Often via library, such as MPI: Message Passing

Interface• Also called "Synchronous communication" since

communication causes synchronization between 2 processes

11

Performance Metrics: Latency and Bandwidth

Bandwidth• Need high bandwidth in communication• Match limits in network, memory, and processor• Challenge is link speed of network interface vs. bisection

bandwidth of network

Latency• Affects performance, since processor may have to wait• Affects ease of programming, since requires more thought to

overlap communication and computation• Overhead to communicate is a problem in many machines

Latency Hiding• How can a mechanism help hide latency?• Increases programming system burden• Examples: overlap message send with computation, prefetch

data, switch to other tasks

12Parallel Architecture

Parallel Architecture extends traditional computer architecture with a communication architecture

• abstractions (HW/SW interface)• organizational structure to realize abstraction

efficiently

13Parallel Framework

Layers:• (see Chapter 1, Figure 1-13, page 27 of [CSG99])• Programming Model:

• Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data Parallel: several agents operate on several data sets

simultaneously and then exchange information globally and simultaneously (shared or message passing)

• Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling)

=> many hardware designs 1:1 programming model

14Shared Address Model Summary

Each processor can name every physical location in the machine

Each process can name all data it shares with other processes

Data transfer via load and store

Data size: byte, word, ... or cache blocks

Uses virtual memory to map virtual to local or remote physical

Memory hierarchy model applies: now communication moves data to local processor cache (as load moves data from memory to cache)

• Latency, BW, scalability when communicate?

15

Shared Address/Memory Multiprocessor Model

Communicate via Load and Store• Oldest and most popular model

Based on timesharing: processes on multiple processors vs. sharing single processorprocess: a virtual address space and ~ 1 thread of control

• Multiple processes can overlap (share), but ALL threadsshare a process address space

Writes to shared address space by one thread are visible to reads of other threads

• Usual model: share code, private stack, some shared heap, some private heap

16SMP Interconnect

Processors to Memory AND to I/O

Bus based: all memory locations equal access time so SMP = “Symmetric MP”

• Sharing limited BW as add processors, I/O• (see Chapter 1, Figs 1-17, page 32-33 of

[CSG99])

17Message Passing Model

Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations

• Essentially NUMA but integrated at I/O devices vs. memory system

Send specifies local buffer + receiving process on remote computerReceive specifies sending process on remote computer + local buffer to place data

• Usually send includes process tag and receive has rule on tag: match 1, match any

• Synch: when send completes, when buffer free, when request accepted, receive wait for send

Send+receive => memory-memory copy, where each supplies local address, AND does pair wise synchronization!

18Data Parallel Model

Operations can be performed in parallel on each element of a large regular data structure, such as an array

1 Control Processor broadcast to many PEs (see Ch. 1, Fig. 1-25, page 45 of [CSG99])

• When computers were large, could amortize the control portion of many replicated PEs

Condition flag per PE so that can skip

Data distributed in each memory

Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE

Data parallel programming languages lay out data to processor

19Data Parallel Model

Vector processors have similar ISAs, but no data placement restriction

SIMD led to Data Parallel Programming languages

Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive)

SIMD programming model led to Single Program Multiple Data (SPMD) model

• All processors execute identical program

Data parallel programming languages still useful, do communication all at once:“Bulk Synchronous” phases in which all communicate after a global barrier

20

Advantages shared-memory communication model

Compatibility with SMP hardware

Ease of programming when communication patterns are complex or vary dynamically during execution

Ability to develop apps using familiar SMP model, attention only on performance critical accesses

Lower communication overhead, better use of BW for small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through I/O system

HW-controlled caching to reduce remote comm. by caching of all data, both shared and private.

21

Advantages message-passing communication model

The hardware can be simpler (esp. vs. NUMA)

Communication explicit => simpler to understand; in shared memory it can be hard to know when communicating and when not, and how costly it is

Explicit communication focuses attention on costly aspect of parallel computation, sometimes leading to improved structure in multiprocessor program

Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization

Easier to use sender-initiated communication, which may have some advantages in performance

22

Communication Models

Shared Memory• Processors communicate with shared address space• Easy on small-scale machines• Advantages:

• Model of choice for uniprocessors, small-scale MPs• Ease of programming• Lower latency• Easier to use hardware controlled caching

Message passing• Processors have private memories,

communicate via messages• Advantages:

• Less hardware, easier to design• Focuses attention on costly non-local operations

Can support either SW model on either HW base

23

Amdahl’s Law and Parallel ComputersAmdahl’s Law (FracX: original % to be speed up)Speedup = 1 / [(FracX/SpeedupX + (1-FracX)]

A portion is sequential => limits parallel speedup• Speedup

24Small-Scale—Shared MemoryCaches serve to:

• Increase bandwidth versus bus/memory

• Reduce latency of access

• Valuable for both private data and shared data

What about cache consistency?

Time

Event

$ A

$ B

X (memory)

0

1

1

CPU A reads X

1

1

2

CPU B reads X

1

1

1

3

CPU A stores 0 into X

0

1

0

25What Does Coherency Mean?Informally:

• “Any read must return the most recent write”• Too strict and too difficult to implement

Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order (“serialization”)

Two rules to ensure this:• “If P writes x and P1 reads it, P’s write will be seen by P1 if the read

and write are sufficiently far apart”• Writes to a single location are serialized:

seen in one order• Latest write will be seen• Otherwise could see writes in illogical order

(could see older value after a newer value)

26Potential HW Coherency Solutions

Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Processors snoop to see if they have a copy and respond accordingly • Requires broadcast, since caching information is at processors• Works well with bus (natural broadcast medium)• Dominates for small scale machines (most of the market)

Directory-Based Schemes (discuss later)• Keep track of what is being shared in 1 centralized place (logically)• Distributed memory => distributed directory for scalability

(avoids bottlenecks)• Send point-to-point requests to processors via network• Scales better than Snooping• Actually existed BEFORE Snooping-based schemes

27

Basic Snoopy Protocols

Write Invalidate Protocol:• Multiple readers, single writer• Write to shared data: an invalidate is sent to all caches which snoop

and invalidate any copies• Read Miss:

• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy

Write Broadcast Protocol (typically write through):• Write to shared data: broadcast on bus, processors snoop, and

update any copies• Read miss: memory is always up-to-date

Write serialization: bus serializes requests!• Bus is single point of arbitration

28Basic Snoopy Protocols

Write Invalidate versus Broadcast:• Invalidate requires one transaction per write-run• Invalidate uses spatial locality: one transaction per block• Broadcast has lower latency between write and read

29Snooping Cache Variations

Berkeley Protocol

Owned ExclusiveOwned Shared

SharedInvalid

Basic Protocol

ExclusiveSharedInvalid

Illinois ProtocolPrivate DirtyPrivate Clean

SharedInvalid

Owner can update via bus invalidate operationOwner must write back when replaced in cache

If read sourced from memory, then Private Cleanif read sourced from other cache, then SharedCan write in cache if held private clean or dirty

MESI Protocol

Modified (private,!=Memory)exclusive (private,=Memory)Shared (shared,=Memory)

Invalid

30An Example Snoopy ProtocolInvalidation protocol, write-back cacheEach block of memory is in one state:

• Clean in all caches and up-to-date in memory (Shared)• OR Dirty in exactly one cache (Exclusive)• OR Not in any caches

Each cache block is in one state (track these):• Shared : block can be read• OR Exclusive : cache has only copy, its writeable, and dirty• OR Invalid : block contains no data

Read misses: cause all caches to snoop busWrites to clean line are treated as misses

31

Snoopy-Cache State Machine-I

State machinefor CPU requestsfor each cache block

InvalidShared

(read/only)

Exclusive(read/write)

CPU Read

CPU Write

CPU Read hit

Place read misson bus

Place Write Miss on bus

CPU read missWrite back block,Place read misson bus

CPU WritePlace Write Miss on Bus

CPU Read missPlace read miss on bus

CPU Write MissWrite back cache blockPlace write miss on bus

CPU read hitCPU write hit

Cache BlockState

32Snoopy-Cache State Machine-II

State machinefor bus requestsfor each cache blockAppendix E? gives details of bus requests

Invalid Shared(read/only)


Write BackBlock; (abortmemory access)

Write missfor this block

Read miss for this block



33

Place read misson bus

Snoopy-Cache State Machine-III State machinefor CPU requestsfor each cache block andfor bus requestsfor each cache block

InvalidShared

(read/only)


CPU Read

CPU Write

CPU Read hit

Place Write Miss on bus

CPU read missWrite back block,Place read misson bus CPU Write

Place Write Miss on Bus

CPU Read missPlace read miss on bus

CPU Write MissWrite back cache blockPlace write miss on bus


Cache BlockState




Read miss for this block


34Example

P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1: Write 10 to A1P1: Read A1P2: Read A1

P2: Write 20 to A1P2: Write 40 to A2

P1: Read A1P2: Read A1

P1 Write 10 to A1


Assumes A1 and A2 map to same cache block,initial cache state is invalid

Snoopy Start Example

P1P2BusMemory

stepStateAddrValueStateAddrValueActionProc.AddrValueAddrValueComments

P1: Write 10 to A1

P1: Read A1

P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

35Example


P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1P2: Read A1



P1 Write 10 to A1


Assumes A1 and A2 map to same cache block


P1P2BusMemory


P1: Write 10 to A1Excl.A110WrMsP1A1

P1: Read A1

P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

36Example


P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1



P1 Write 10 to A1




P1P2BusMemory



P1: Read A1Excl.A110

P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

37Example


P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 WrBk P1 A1 10 A1 10Shar. A1 10 RdDa P2 A1 10 A1 10



P1 Write 10 to A1




P1P2BusMemory




P2: Read A1Shar.A1RdMsP2A1

Shar.A110WrBkP1A110A110

Shar.A110RdDaP2A110A110

P2: Write 20 to A1

P2: Write 40 to A2

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

38Example




P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10P2: Write 40 to A2


P1 Write 10 to A1




P1P2BusMemory







P2: Write 20 to A1Inv.Excl.A120WrMsP2A1A110

P2: Write 40 to A2

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

39Example




P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10P2: Write 40 to A2 WrMs P2 A2 A1 10

Excl. A2 40 WrBk P2 A1 20 A1 20


P1 Write 10 to A1


Assumes A1 and A2 map to same cache block,but A1 != A2


P1P2BusMemory







P2: Write 20 to A1Inv.Excl.A120WrMsP2A1A110

P2: Write 40 to A2WrMsP2A2A110

Excl.A240WrBkP2A120A120

&F

Button 3

Button 4

P1: Read A1

P2: Read A1

P1 Write 10 to A1

P2: Write 20 to A1

P2: Write 40 to A2

40Implementation Complications

Write Races:• Cannot update cache until bus is obtained

• Otherwise, another processor may get bus first, and then write the same cache block!

• Two step process:• Arbitrate for bus • Place miss on bus and complete operation

• If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart.

• Split transaction bus:• Bus transaction is not atomic:

can have multiple outstanding transactions for a block• Multiple misses can interleave,

allowing two caches to grab block in the Exclusive state• Must track and prevent multiple misses for one block

Must support interventions and invalidations

41Implementing Snooping Caches

Multiple processors must be on bus, access to both addresses and dataAdd a few new commands to perform coherency, in addition to read and writeProcessors continuously snoop on address bus

• If address matches tag, either invalidate or update

Since every bus transaction checks cache tags, could interfere with CPU just to check:

• solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU

• solution 2: L2 cache already duplicate, provided L2 obeys inclusion with L1 cache

• block size, associatively of L2 affects L1

42Implementing Snooping Caches

Bus serializes writes, getting bus ensures no one else can perform memory operation

On a miss in a write back cache, may have the desired copy and its dirty, so must reply

Add extra state bit to cache to determine shared or not

Add 4th state (MESI)

43Fundamental Issues

3 Issues to characterize parallel machines

1) Naming

2) Synchronization

3) Performance: Latency and Bandwidth (covered earlier)

44Fundamental Issue #1: Naming

Naming: how to solve large problem fast• what data is shared• how it is addressed• what operations can access data• how processes refer to each other

Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passingChoice of naming affects replication of data; via load in cache memory hierarchy or via SW replication and consistency

45Fundamental Issue #1: Naming

Global physical address space: any processor can generate, address and access it in a single operation

• memory can be anywhere: virtual addr. translation handles it

Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program

Segmented shared address space: locations are named uniformly for all processes of the parallel program

46Fundamental Issue #2: Synchronization

To cooperate, processes must coordinate

Message passing is implicit coordination with transmission or arrival of data

Shared address => additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

47Summary: Parallel Framework

Layers:• Programming Model:

• Multiprogramming : lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and recieve messages• Data Parallel: several agents operate on several data sets

simultaneously and then exchange information globally and simultaneously (shared or message passing)

• Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling)

=> many hardware designs 1:1 programming model

Programming ModelCommunication AbstractionInterconnection SW/OS Interconnection HW

48Review: MultiprocessorBasic issues and terminologyCommunication: share memory, message passingParallel Application:

• Commercial workload: OLTP, DSS, Web index search• Multiprogramming and OS• Scientific/Technical

Amdahl’s Law: Speedup

49Larger MPsSeparate Memory per ProcessorLocal or Remote access via memory controller1 Cache Coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache

• Which caches have a copies of block, dirty vs. clean, ...

Info per memory block vs. per cache block?• PLUS: In memory => simpler protocol (centralized/one

location)• MINUS: In memory => directory is ƒ(memory size) vs.

ƒ(cache size)

Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks

50Distributed Directory MPs

51Directory ProtocolSimilar to Snoopy Protocol: Three states

• Shared: ≥ 1 processors have data, memory up-to-date• Uncached (no processor hasit; not valid in any cache)• Exclusive: 1 processor (owner) has data;

memory out-of-date

In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)

Keep it simple(r):• Writes to non-exclusive data

=> write miss• Processor blocks until access completes• Assume messages received

and acted upon in order sent

52Directory Protocol

No bus and don’t want to broadcast:• interconnect no longer single arbitration point• all messages have explicit responses

Terms: typically 3 processors involved• Local node where a request originates• Home node where the memory location

of an address resides• Remote node has a copy of a cache

block, whether exclusive or shared

Example messages on next slide: P = processor number, A = address

53

Directory Protocol Messages

Message type Source Destination Msg ContentRead miss Local cache Home directory P, A

• Processor P reads data at address A; make P a read sharer and arrange to send data back

Write miss Local cache Home directory P, A• Processor P writes data at address A;

make P the exclusive owner and arrange to send data back Invalidate Home directory Remote caches A

• Invalidate a shared copy at address A.Fetch Home directory Remote cache A

• Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

• Fetch the block at address A and send it to its home directory; invalidate the block in the cache

Data value reply Home directory Local cache Data• Return a data value from the home memory (read miss response)

Data write-back Remote cache Home directory A, Data• Write-back a data value for address A (invalidate response)

54

State Transition Diagram for an Individual Cache Block in a Directory Based System

States identical to snoopy case; transactions very similar.

Transitions caused by read misses, write misses, invalidates, data fetch requests

Generates read miss & write miss msg to home directory.

Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

Note: on a write, a cache block is bigger, so need to read the full cache block

55

CPU -Cache State Machine

State machinefor CPU requestsfor each memory block

Invalid stateif in memory

Fetch/Invalidatesend Data Write Back message

to home directory

Invalidate

InvalidShared

(read/only)

Exclusive(read/writ)

CPU Read

CPU Read hit

Send Read Missmessage

CPU Write:Send Write Miss msg to h.d.

CPU Write:Send Write Miss messageto home directory


Fetch: send Data Write Back message to home directory

CPU read miss:Send Read Miss

CPU write miss:send Data Write Back message and Write Miss to home directory

CPU read miss: send Data Write Back message and read miss to home directory

56

Same states & structure as the transition diagram for an individual cache

2 actions: update of directory state & send msgs to statisfy requests

Tracks all copies of memory block.

Also indicates an action that updates the sharing set, Sharers, as well as sending a message.

State Transition Diagram for Directory

57

Directory State Machine

State machinefor Directory requests for each memory block

Uncached stateif in memory

Data Write Back:Sharers = {}

(Write back block)

UncachedShared

(read only)

Exclusive(read/writ)

Read miss:Sharers = {P}send Data Value Reply

Write Miss: send Invalidate to Sharers;then Sharers = {P};send Data Value Reply msg

Write Miss:Sharers = {P}; send Data Value Replymsg

Read miss:Sharers += {P}; send Fetch;send Data Value Replymsg to remote cache(Write back block)

Read miss: Sharers += {P};send Data Value Reply

Write Miss:Sharers = {P}; send Fetch/Invalidate;send Data Value Replymsg to remote cache

58

Example Directory Protocol

Message sent to directory causes two actions:• Update the directory• More messages to satisfy request

Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

• Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

• Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

Block is Shared => the memory value is up-to-date:• Read miss: requesting processor is sent back the data from memory &

requesting processor is added to the sharing set.• Write miss: requesting processor is sent the value. All processors in

the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

59Example Directory Protocol

Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

• Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.

• Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.

• Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

60Example

P1 P2 Bus Directory Memorystep State Addr ValueState Addr ValueAction Proc. Addr Value Addr State {Procs} Value

P1: Write 10 to A1


P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

Processor 1 Processor 2 Interconnect MemoryDirectory

Sheet: Directory Start Example

P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

P1: Read A1

P2: Read A1

P2: Write 40 to A2

61Example


P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0


P2: Write 40 to A2

P2: Write 20 to A1




P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

WrMs

P1

A1

A1

Ex

{P1}

Excl.

A1

10.0

DaRp

P1

A1

0.0

P1: Read A1

P2: Read A1

P2: Write 40 to A2

62Example



P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1




P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

WrMs

P1

A1

A1

Ex

{P1}

Excl.

A1

10.0

DaRp

P1

A1

0.0

P1: Read A1

Excl.

A1

10.0

P2: Read A1

P2: Write 40 to A2

63Example

P2: Write 20 to A1




P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10

1010

P2: Write 40 to A2 10


A1

Write Back

A1


P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

WrMs

P1

A1

A1

Ex

{P1}

Excl.

A1

10.0

DaRp

P1

A1

0.0

P1: Read A1

Excl.

A1

10.0

P2: Read A1

Shar.

A1

RdMs

P2

A1

Shar.

A1

10.0

Ftch

P1

A1

10.0

10.0

Shar.

A1

10.0

DaRp

P2

A1

10.0

A1

Shar.

{P1,P2}

10.0

10.0

10.0

P2: Write 40 to A2

10.0

64Example

P2: Write 20 to A1





Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10


A1A1


P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

WrMs

P1

A1

A1

Ex

{P1}

Excl.

A1

10.0

DaRp

P1

A1

0.0

P1: Read A1

Excl.

A1

10.0

P2: Read A1

Shar.

A1

RdMs

P2

A1

Shar.

A1

10.0

Ftch

P1

A1

10.0

10.0

Shar.

A1

10.0

DaRp

P2

A1

10.0

A1

Shar.

{P1,P2}

10.0

Excl.

A1

20.0

WrMs

P2

A1

10.0

Inv.

Inval.

P1

A1

A1

Excl.

{P2}

10.0

P2: Write 40 to A2

10.0

65Example

P2: Write 20 to A1





Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0


A1A1


P1

P2

Bus

Directory

Memory

step

State

Addr

Value

State

Addr

Value

Action

Proc.

Addr

Value

Addr

State

{Procs}

Value

Comments

P1: Write 10 to A1

WrMs

P1

A1

A1

Ex

{P1}

Excl.

A1

10.0

DaRp

P1

A1

0.0

P1: Read A1

Excl.

A1

10.0

P2: Read A1

Shar.

A1

RdMs

P2

A1

Shar.

A1

10.0

Ftch

P1

A1

10.0

10.0

Shar.

A1

10.0

DaRp

P2

A1

10.0

A1

Shar.

{P1,P2}

10.0

Excl.

A1

20.0

WrMs

P2

A1

10.0

Inv.

Inval.

P1

A1

A1

Excl.

{P2}

10.0

P2: Write 40 to A2

WrMs

P2

A2

A2

Excl.

{P2}

0.0

WrBk

P2

A1

20.0

A1

Unca.

{}

20.0

Excl.

A2

40.0

DaRp

P2

A2

0.0

A2

Excl.

{P2}

0.0

66Implementing a Directory

We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of bufffers in network (see Appendix E)

Optimizations:• read miss or write miss in Exclusive: send data directly

to requestor from owner vs. 1st to memory and then from memory to requestor

67Synchronization

Why Synchronize? Need to know when it is safe for different processes to use shared data

Issues for Synchronization:• Uninterruptable instruction to fetch and update memory

(atomic operation);• User level synchronization operation using this primitive;• For large scale MPs, synchronization can be a bottleneck;

techniques to reduce contention and latency of synchronization

68

Uninterruptable Instruction to Fetch and Update Memory

Atomic exchange: interchange a value in a register for a value in memory

0 => synchronization variable is free 1 => synchronization variable is locked and unavailable• Set register to 1 & swap• New value in register determines success in getting lock

0 if you succeeded in setting the lock (you were first)1 if other processor had already claimed access

• Key is that exchange operation is indivisible

Test-and-set: tests a value and sets it if the value passes the testFetch-and-increment: it returns the value of a memory location and atomically increments it

• 0 => synchronization variable is free

69

Uninterruptable Instruction to Fetch and Update Memory

Hard to have read & write in 1 instruction: use 2 insteadLoad linked (or load locked) + store conditional

• Load linked returns the initial value• Store conditional returns 1 if it succeeds (no other store to same

memory location since preceeding load) and 0 otherwise

Example doing atomic swap with LL & SC:try: mov R3,R4 ; mov exchange valuell R2,0(R1); load linkedsc R3,0(R1); store conditionalbeqz R3,try ; branch store fails (R3 = 0)mov R4,R2 ; put load value in R4

Example doing fetch & increment with LL & SC:try: ll R2,0(R1); load linkedaddi R2,R2,#1 ; increment (OK if reg–reg)sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

70

User Level Synchronization—Operation Using this Primitive

Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

li R2,#1lockit: exch R2,0(R1) ;atomic exchange

bnez R2,lockit ;already locked?

What about MP with cache coherency?• Want to spin on cache copy to avoid full memory latency• Likely to get cache hits for such variables

Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus trafficSolution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”):

try: li R2,#1lockit: lw R3,0(R1) ;load var

bnez R3,lockit ;not free=>spinexch R2,0(R1) ;atomic exchangebnez R2,try ;already locked?

71

Another MP Issue:Memory Consistency Models

What is consistency? When must a processor see the new value? e.g., seems that

P1: A = 0; P2: B = 0;..... .....

A = 1; B = 1;L1: if (B == 0) ... L2: if (A == 0) ...

Impossible for both if statements L1 & L2 to be true?• What if write invalidate is delayed & processor continues?

Memory consistency models: what are the rules for such cases?Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above

• SC: delay all memory accesses until all invalidates done

72Memory Consistency Model

Schemes faster execution to sequential consistencyNot really an issue for most programs; they are synchronized

• A program is synchronized if all access to shared data are ordered by synchronization operations

write (x)...release (s) {unlock}...acquire (s) {lock}...read(x)

Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

73Summary

Caches contain all information on state of cached memory blocks

Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)

Directory has extra data structure to keep track of state of all cache blocks

Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access

74Review

Caches contain all information on state of cached memory blocks

Snooping and Directory Protocols similar

Bus makes snooping easier because of broadcast (snooping => Uniform Memory Access)

Directory has extra data structure to keep track of state of all cache blocks

Distributing directory => scalable shared address multiprocessor => Cache coherent, Non Uniform Memory Access (NUMA)

75Parallel App: Commercial WorkloadOnline transaction processing workload (OLTP) (like TPC-B or -C)

Decision support system (DSS) (like TPC-D)

Web index search (Altavista)

Benchmark % TimeUserMode

% TimeKernel

% TimeI/O time(CPU Idle)

OLTP 71% 18% 11%

DSS (range) 82-94% 3-5% 4-13%

DSS (avg) 87% 4% 9%

Altavista > 98% < 1% 98%

< 1%

76Alpha 4100 SMP

4 CPUs

300 MHz Apha 211264 @ 300 MHz

L1$ 8KB direct mapped, write through

L2$ 96KB, 3-way set associative

L3$ 2MB (off chip), direct mapped

Memory latency 80 clock cycles

Cache to cache 125 clock cycles

77OLTP Performance as vary L3$ size

OLTP execution time vs. cache

1 MB1 MB1 MB1 MB1 MB



8MB8MB8MB8MB8MB

Instruction Execution

L2/L3 Cache Access

Memory Access

PAL Code

Idle

L3 Cache Size

Normalized Execution Time

14.724087

25.577738

47.733365

6.136279

7.594701

11.580399

22.576476

21.534974

5.372016

18.557369

11.449203

23.652949

13.776782

5.002901

23.025655

11.437336

24.065881

11.30766

4.88

25.44

overall performance

OLTPOLTPOLTPOLTPOLTP

DSSDSSDSSDSSDSS

AVAVAVAVAV

Instruction execution

L2 Access

L3 Access

Memory Access

Other stalls

Fraction of Execution Time

10

14

24

39

13

47.3333333333

25.6666666667

4.5

6.1666666667

16.3333333333

52

17

2

11

18

L3 miss breakdown





True Sharing

False Sharing

Cold

Capacity/Conflict

Instruction

Cache size

Memory cycles per instruction

0.6512868427

0.0773471124

0.1021213146

1.0248247865

1.3753797753

0.6487216629

0.0719602697

0.0920437079

0.4956711573

0.4787423483

0.7013416742

0.070965618

0.0731213933

0.2156517978

0.115407809

0.7089221011

0.0702448989

0.0490165506

0.1006045955

0.0323258989

cycles vs processor count

11111

22222

44444

66666

88888

True Sharing

False Sharing

Cold

Conflict/Capacity

Instruction

Processor count


0

0

0.0567118427

0.5457095393

0.5528361573

0.3254800899

0.0375244719

0.0661505281

0.5119018989

0.5633609213

0.6235548876

0.0727557865

0.0975642584

0.5200540225

0.4676312584

0.9013379663

0.1055005843

0.1468118427

0.6365803371

0.6098165281

1.2183957303

0.1378625056

0.2174786404

0.7185191348

0.6057956742

misses versus block size

3232323232

6464646464

128128128128128

256256256256256

True Sharing

False Sharing

Cold

Capacity/Conflict

Insruction

Block size in bytes

Misses per 1,000 instructions

5.8915905618

0.3366188764

1.2837953933

4.3034049438

4.3539332584

4.1065416854

0.4630585393

0.8213173034

3.8563835955

4.2756446067

2.9492957303

0.5748061798

0.5202086517

3.5625402247

4.0900161798

2.0993595506

0.636398764

0.3382467416

3.4438544944

4.0825640449

commercial mp workload

Figure3

single issuedual issueinstruction stallsdata stallsCPI

OLTP6.2974.0146.543.18797.982

DSS32.317.933.116.71.591.05

DSS28.113.233.125.61.993.4

DSS30.514.732.522.31.892.65

DSS401524211.592.5

DSS29.113.137.120.11.592.85

DSS35.815.223.925.11.592.4

AV28.68123.36619.91528.0361.388.315palidle

6.1362797.594701

Basic performancesingle issuedual issueinstruction stallsdata stallsCPI

OLTP6.2974.0146.543.18797.982

DSS32.633333333314.8530.616666666721.81.616666666792.475

AV28.68123.36619.91528.0361.388.315

Basic performanceInstruction executionInstruction cache stallData cache stall

OLTP0.59310893843.32203874183.08485231987

DSS0.7003078910.53524676340.38111201231.6166666667

AV0.59415954250.29314952160.41269093591.3

5.37201618.557369

Figure45.00290123.025655

L2 AccessL3 AccessMemory AccessInstruction executionMemory barrierReplay trapsBranch mispredictTLB MissOther stalls

OLTP14243910352313

DSS314249086014

DSS258941194317

DSS2546461152119

DSS253555075012

DSS2347431192123

DSS254850174113

AV1721152449118

Instruction executionL2 AccessL3 AccessMemory AccessOther stalls

OLTP1014243913

DSS47.333333333325.66666666674.56.166666666716.3333333333

AV521721118

Figure5a

---------

TITLE(a)ExecutionTime

GROUPS4

SETS2

SECTIONS8

GROUP_LABELS1MB2MB4MB8MB1MB2MB4MB8MB

SET_LABELS1w2w

Y_LABELNormalizedExecutionTime101.76617

Instruction ExecutionL2/L3 Cache AccessMemory AccessPAL CodeIdleKernel Instr.kernel-CACHEkernel-MEM

1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

Figure5b

---------

Kernel time0.202247191Use time0.7977528093.1775280899

user partTrue SharingFalse SharingColdCapacity/ConflictInstruction

1 MB0.6194590.0531990.0685591.0834521.617896

2 MB0.6261980.0498320.0621540.5207490.558685

4 MB0.6818230.0478520.050640.2201880.129771

8 MB0.6911830.047070.0355570.0944670.031777

kernel part0.776830.1725980.2345060.7935730.418788

kernel part0.7375650.1592440.2099420.3967530.163413

kernel part0.7783320.1621360.1617980.1977590.058753

kernel part0.7788930.1616570.1021070.1248140.034491

CombinedTrue SharingFalse SharingColdCapacity/ConflictInstruction

1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

Figure6

--------

Y_LABELMemorycyclesperinstruction

SECTION_LABELSTrue SharingFalse SharingColdConflict/CapacityInstruction

Duser1000.027890.5397030.611124

Duser20.2994260.0235320.0371770.5385330.618988

Duser40.5931630.0504730.0659510.5470520.538618

Duser60.8249130.0671980.1109340.6472960.695899

Duser81.1333160.0882990.1740950.7150870.692495

Dkernel1000.1703980.5694020.322923

Dkernel20.4282490.0927170.1804350.4068570.343943

Dkernel40.7434340.1606490.2222610.4135620.187628

Dkernel61.2027920.2565830.288330.5943130.270269

Dkernel81.5539880.3333630.3886030.7320570.263815

Procesor countTrue SharingFalse SharingColdConflict/CapacityInstructionTotal

1000.05671184270.54570953930.55283615731.1552575393

20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

Figure8

--------

GROUPS4

SETS4

SECTIONS5

GROUP_LABELSOTLP-UserOLTP-KernelDSS-Q5DSS-Q6

SET_LABELS3264128256

SECTION_LABELS

Y_LABELMissesper100instructions

True SharingFalse SharingColdCapacity/ConflictInsruction

user320.5865980.0213240.0799110.460050.502738

user640.3969590.0321530.0547420.4142980.498209

user1280.272090.0421610.0376150.383180.478084

user2560.1878720.0484590.0273020.3735590.47617

kernel320.5992610.0823280.3195610.3131530.169756

kernel640.4646740.1021310.1901690.2725920.148911

kernel1280.3850190.1179080.1088440.2500460.13651

kernel2560.2969660.123520.0595530.2293120.140375

Misses per 1000 instruction


325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

78L3 Miss Breakdown





8MB8MB8MB8MB8MB


L2/L3 Cache Access

Memory Access

PAL Code

Idle

L3 Cache Size


14.724087

25.577738

47.733365

6.136279

7.594701

11.580399

22.576476

21.534974

5.372016

18.557369

11.449203

23.652949

13.776782

5.002901

23.025655

11.437336

24.065881

11.30766

4.88

25.44

overall performance


DSSDSSDSSDSSDSS

AVAVAVAVAV


L2 Access

L3 Access

Memory Access

Other stalls


10

14

24

39

13

47.3333333333

25.6666666667

4.5

6.1666666667

16.3333333333

52

17

2

11

18

L3 miss breakdown





True Sharing

False Sharing

Cold

Capacity/Conflict

Instruction

Cache size


0.6512868427

0.0773471124

0.1021213146

1.0248247865

1.3753797753

0.6487216629

0.0719602697

0.0920437079

0.4956711573

0.4787423483

0.7013416742

0.070965618

0.0731213933

0.2156517978

0.115407809

0.7089221011

0.0702448989

0.0490165506

0.1006045955

0.0323258989


11111

22222

44444

66666

88888

True Sharing

False Sharing

Cold

Conflict/Capacity

Instruction

Processor count


0

0

0.0567118427

0.5457095393

0.5528361573

0.3254800899

0.0375244719

0.0661505281

0.5119018989

0.5633609213

0.6235548876

0.0727557865

0.0975642584

0.5200540225

0.4676312584

0.9013379663

0.1055005843

0.1468118427

0.6365803371

0.6098165281

1.2183957303

0.1378625056

0.2174786404

0.7185191348

0.6057956742


3232323232

6464646464

128128128128128

256256256256256

True Sharing

False Sharing

Cold

Capacity/Conflict

Insruction

Block size in bytes


5.8915905618

0.3366188764

1.2837953933

4.3034049438

4.3539332584

4.1065416854

0.4630585393

0.8213173034

3.8563835955

4.2756446067

2.9492957303

0.5748061798

0.5202086517

3.5625402247

4.0900161798

2.0993595506

0.636398764

0.3382467416

3.4438544944

4.0825640449


Figure3


OLTP6.2974.0146.543.18797.982

DSS32.317.933.116.71.591.05

DSS28.113.233.125.61.993.4

DSS30.514.732.522.31.892.65

DSS401524211.592.5

DSS29.113.137.120.11.592.85

DSS35.815.223.925.11.592.4

AV28.68123.36619.91528.0361.388.315palidle

6.1362797.594701


OLTP6.2974.0146.543.18797.982

DSS32.633333333314.8530.616666666721.81.616666666792.475

AV28.68123.36619.91528.0361.388.315


OLTP0.59310893843.32203874183.08485231987

DSS0.7003078910.53524676340.38111201231.6166666667

AV0.59415954250.29314952160.41269093591.3

5.37201618.557369

Figure45.00290123.025655


OLTP14243910352313

DSS314249086014

DSS258941194317

DSS2546461152119

DSS253555075012

DSS2347431192123

DSS254850174113

AV1721152449118


OLTP1014243913

DSS47.333333333325.66666666674.56.166666666716.3333333333

AV521721118

Figure5a

---------


GROUPS4

SETS2

SECTIONS8


SET_LABELS1w2w



1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

Figure5b

---------



1 MB0.6194590.0531990.0685591.0834521.617896

2 MB0.6261980.0498320.0621540.5207490.558685

4 MB0.6818230.0478520.050640.2201880.129771

8 MB0.6911830.047070.0355570.0944670.031777

kernel part0.776830.1725980.2345060.7935730.418788

kernel part0.7375650.1592440.2099420.3967530.163413

kernel part0.7783320.1621360.1617980.1977590.058753

kernel part0.7788930.1616570.1021070.1248140.034491


1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

Figure6

--------



Duser1000.027890.5397030.611124

Duser20.2994260.0235320.0371770.5385330.618988

Duser40.5931630.0504730.0659510.5470520.538618

Duser60.8249130.0671980.1109340.6472960.695899

Duser81.1333160.0882990.1740950.7150870.692495

Dkernel1000.1703980.5694020.322923

Dkernel20.4282490.0927170.1804350.4068570.343943

Dkernel40.7434340.1606490.2222610.4135620.187628

Dkernel61.2027920.2565830.288330.5943130.270269

Dkernel81.5539880.3333630.3886030.7320570.263815


1000.05671184270.54570953930.55283615731.1552575393

20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

Figure8

--------

GROUPS4

SETS4

SECTIONS5



SECTION_LABELS



user320.5865980.0213240.0799110.460050.502738

user640.3969590.0321530.0547420.4142980.498209

user1280.272090.0421610.0376150.383180.478084

user2560.1878720.0484590.0273020.3735590.47617

kernel320.5992610.0823280.3195610.3131530.169756

kernel640.4646740.1021310.1901690.2725920.148911

kernel1280.3850190.1179080.1088440.2500460.13651

kernel2560.2969660.123520.0595530.2293120.140375



325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

79Memory CPI as increase CPUs





8MB8MB8MB8MB8MB


L2/L3 Cache Access

Memory Access

PAL Code

Idle

L3 Cache Size


14.724087

25.577738

47.733365

6.136279

7.594701

11.580399

22.576476

21.534974

5.372016

18.557369

11.449203

23.652949

13.776782

5.002901

23.025655

11.437336

24.065881

11.30766

4.88

25.44

overall performance


DSSDSSDSSDSSDSS

AVAVAVAVAV


L2 Access

L3 Access

Memory Access

Other stalls


10

14

24

39

13

47.3333333333

25.6666666667

4.5

6.1666666667

16.3333333333

52

17

2

11

18

L3 miss breakdown





True Sharing

False Sharing

Cold

Capacity/Conflict

Instruction

Cache size


0.6512868427

0.0773471124

0.1021213146

1.0248247865

1.3753797753

0.6487216629

0.0719602697

0.0920437079

0.4956711573

0.4787423483

0.7013416742

0.070965618

0.0731213933

0.2156517978

0.115407809

0.7089221011

0.0702448989

0.0490165506

0.1006045955

0.0323258989


11111

22222

44444

66666

88888

True Sharing

False Sharing

Cold

Conflict/Capacity

Instruction

Processor count


0

0

0.0567118427

0.5457095393

0.5528361573

0.3254800899

0.0375244719

0.0661505281

0.5119018989

0.5633609213

0.6235548876

0.0727557865

0.0975642584

0.5200540225

0.4676312584

0.9013379663

0.1055005843

0.1468118427

0.6365803371

0.6098165281

1.2183957303

0.1378625056

0.2174786404

0.7185191348

0.6057956742


3232323232

6464646464

128128128128128

256256256256256

True Sharing

False Sharing

Cold

Capacity/Conflict

Insruction

Block size in bytes


5.8915905618

0.3366188764

1.2837953933

4.3034049438

4.3539332584

4.1065416854

0.4630585393

0.8213173034

3.8563835955

4.2756446067

2.9492957303

0.5748061798

0.5202086517

3.5625402247

4.0900161798

2.0993595506

0.636398764

0.3382467416

3.4438544944

4.0825640449


Figure3


OLTP6.2974.0146.543.18797.982

DSS32.317.933.116.71.591.05

DSS28.113.233.125.61.993.4

DSS30.514.732.522.31.892.65

DSS401524211.592.5

DSS29.113.137.120.11.592.85

DSS35.815.223.925.11.592.4

AV28.68123.36619.91528.0361.388.315palidle

6.1362797.594701


OLTP6.2974.0146.543.18797.982

DSS32.633333333314.8530.616666666721.81.616666666792.475

AV28.68123.36619.91528.0361.388.315


OLTP0.59310893843.32203874183.08485231987

DSS0.7003078910.53524676340.38111201231.6166666667

AV0.59415954250.29314952160.41269093591.3

5.37201618.557369

Figure45.00290123.025655


OLTP14243910352313

DSS314249086014

DSS258941194317

DSS2546461152119

DSS253555075012

DSS2347431192123

DSS254850174113

AV1721152449118


OLTP1014243913

DSS47.333333333325.66666666674.56.166666666716.3333333333

AV521721118

Figure5a

---------


GROUPS4

SETS2

SECTIONS8


SET_LABELS1w2w



1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

Figure5b

---------



1 MB0.6194590.0531990.0685591.0834521.617896

2 MB0.6261980.0498320.0621540.5207490.558685

4 MB0.6818230.0478520.050640.2201880.129771

8 MB0.6911830.047070.0355570.0944670.031777

kernel part0.776830.1725980.2345060.7935730.418788

kernel part0.7375650.1592440.2099420.3967530.163413

kernel part0.7783320.1621360.1617980.1977590.058753

kernel part0.7788930.1616570.1021070.1248140.034491


1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

Figure6

--------



Duser1000.027890.5397030.611124

Duser20.2994260.0235320.0371770.5385330.618988

Duser40.5931630.0504730.0659510.5470520.538618

Duser60.8249130.0671980.1109340.6472960.695899

Duser81.1333160.0882990.1740950.7150870.692495

Dkernel1000.1703980.5694020.322923

Dkernel20.4282490.0927170.1804350.4068570.343943

Dkernel40.7434340.1606490.2222610.4135620.187628

Dkernel61.2027920.2565830.288330.5943130.270269

Dkernel81.5539880.3333630.3886030.7320570.263815


1000.05671184270.54570953930.55283615731.1552575393

20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

Figure8

--------

GROUPS4

SETS4

SECTIONS5



SECTION_LABELS



user320.5865980.0213240.0799110.460050.502738

user640.3969590.0321530.0547420.4142980.498209

user1280.272090.0421610.0376150.383180.478084

user2560.1878720.0484590.0273020.3735590.47617

kernel320.5992610.0823280.3195610.3131530.169756

kernel640.4646740.1021310.1901690.2725920.148911

kernel1280.3850190.1179080.1088440.2500460.13651

kernel2560.2969660.123520.0595530.2293120.140375



325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

80OLTP Performance as vary L3$ size





8MB8MB8MB8MB8MB


L2/L3 Cache Access

Memory Access

PAL Code

Idle

L3 Cache Size


14.724087

25.577738

47.733365

6.136279

7.594701

11.580399

22.576476

21.534974

5.372016

18.557369

11.449203

23.652949

13.776782

5.002901

23.025655

11.437336

24.065881

11.30766

4.88

25.44

overall performance


DSSDSSDSSDSSDSS

AVAVAVAVAV


L2 Access

L3 Access

Memory Access

Other stalls


10

14

24

39

13

47.3333333333

25.6666666667

4.5

6.1666666667

16.3333333333

52

17

2

11

18

L3 miss breakdown





True Sharing

False Sharing

Cold

Capacity/Conflict

Instruction

Cache size


0.6512868427

0.0773471124

0.1021213146

1.0248247865

1.3753797753

0.6487216629

0.0719602697

0.0920437079

0.4956711573

0.4787423483

0.7013416742

0.070965618

0.0731213933

0.2156517978

0.115407809

0.7089221011

0.0702448989

0.0490165506

0.1006045955

0.0323258989


11111

22222

44444

66666

88888

True Sharing

False Sharing

Cold

Conflict/Capacity

Instruction

Processor count


0

0

0.0567118427

0.5457095393

0.5528361573

0.3254800899

0.0375244719

0.0661505281

0.5119018989

0.5633609213

0.6235548876

0.0727557865

0.0975642584

0.5200540225

0.4676312584

0.9013379663

0.1055005843

0.1468118427

0.6365803371

0.6098165281

1.2183957303

0.1378625056

0.2174786404

0.7185191348

0.6057956742


3232323232

6464646464

128128128128128

256256256256256

True Sharing

False Sharing

Cold

Capacity/Conflict

Insruction

Block size in bytes


5.8915905618

0.3366188764

1.2837953933

4.3034049438

4.3539332584

4.1065416854

0.4630585393

0.8213173034

3.8563835955

4.2756446067

2.9492957303

0.5748061798

0.5202086517

3.5625402247

4.0900161798

2.0993595506

0.636398764

0.3382467416

3.4438544944

4.0825640449


Figure3


OLTP6.2974.0146.543.18797.982

DSS32.317.933.116.71.591.05

DSS28.113.233.125.61.993.4

DSS30.514.732.522.31.892.65

DSS401524211.592.5

DSS29.113.137.120.11.592.85

DSS35.815.223.925.11.592.4

AV28.68123.36619.91528.0361.388.315palidle

6.1362797.594701


OLTP6.2974.0146.543.18797.982

DSS32.633333333314.8530.616666666721.81.616666666792.475

AV28.68123.36619.91528.0361.388.315


OLTP0.59310893843.32203874183.08485231987

DSS0.7003078910.53524676340.38111201231.6166666667

AV0.59415954250.29314952160.41269093591.3

5.37201618.557369

Figure45.00290123.025655


OLTP14243910352313

DSS314249086014

DSS258941194317

DSS2546461152119

DSS253555075012

DSS2347431192123

DSS254850174113

AV1721152449118


OLTP1014243913

DSS47.333333333325.66666666674.56.166666666716.3333333333

AV521721118

Figure5a

---------


GROUPS4

SETS2

SECTIONS8


SET_LABELS1w2w



1 MB14.72408725.57773847.7333656.1362797.5947013.1389996.1773927.444221

2 MB11.58039922.57647621.5349745.37201618.5573692.9970046.2066955.25797

4 MB11.44920323.65294913.7767825.00290123.0256552.9405136.2116064.465508

8MB11.43733624.06588111.307664.8825.442.9312876.2142584.184296

Figure5b

---------



1 MB0.6194590.0531990.0685591.0834521.617896

2 MB0.6261980.0498320.0621540.5207490.558685

4 MB0.6818230.0478520.050640.2201880.129771

8 MB0.6911830.047070.0355570.0944670.031777

kernel part0.776830.1725980.2345060.7935730.418788

kernel part0.7375650.1592440.2099420.3967530.163413

kernel part0.7783320.1621360.1617980.1977590.058753

kernel part0.7788930.1616570.1021070.1248140.034491


1 MB0.65128684270.07734711240.10212131461.02482478651.37537977533.2309598315

2 MB0.64872166290.07196026970.09204370790.49567115730.47874234831.7871391461

4 MB0.70134167420.0709656180.07312139330.21565179780.1154078091.1764882921

8 MB0.70892210110.07024489890.04901655060.10060459550.03232589890.9611140449

Figure6

--------



Duser1000.027890.5397030.611124

Duser20.2994260.0235320.0371770.5385330.618988

Duser40.5931630.0504730.0659510.5470520.538618

Duser60.8249130.0671980.1109340.6472960.695899

Duser81.1333160.0882990.1740950.7150870.692495

Dkernel1000.1703980.5694020.322923

Dkernel20.4282490.0927170.1804350.4068570.343943

Dkernel40.7434340.1606490.2222610.4135620.187628

Dkernel61.2027920.2565830.288330.5943130.270269

Dkernel81.5539880.3333630.3886030.7320570.263815


1000.05671184270.54570953930.55283615731.1552575393

20.32548008990.03752447190.06615052810.51190189890.56336092131.5044179101

40.62355488760.07275578650.09756425840.52005402250.46763125841.7815602135

60.90133796630.10550058430.14681184270.63658033710.60981652812.4000472584

81.21839573030.13786250560.21747864040.71851913480.60579567422.8980516854

Figure8

--------

GROUPS4

SETS4

SECTIONS5



SECTION_LABELS



user320.5865980.0213240.0799110.460050.502738

user640.3969590.0321530.0547420.4142980.498209

user1280.272090.0421610.0376150.383180.478084

user2560.1878720.0484590.0273020.3735590.47617

kernel320.5992610.0823280.3195610.3131530.169756

kernel640.4646740.1021310.1901690.2725920.148911

kernel1280.3850190.1179080.1088440.2500460.13651

kernel2560.2969660.123520.0595530.2293120.140375



325.89159056180.33661887641.28379539334.30340494384.353933258416.1693430337

644.10654168540.46305853930.82131730343.85638359554.275644606713.5229457303

1282.94929573030.57480617980.52020865173.56254022474.090016179811.6968669663

2562.09935955060.6363987640.33824674163.44385449444.082564044910.6004235955

81

NUMA Memory performance for Scientific Apps on SGI Origin 2000

Show average cycles per memory reference in 4 categories:

Cache Hit

Miss to local memory

Remote miss to home

3-network hop miss to remote cache

82SGI Origin 2000

a pure NUMA

2 CPUs per node,

Scales up to 2048 processors

Design for scientific computation vs. commercial processing

Scalable bandwidth is crucial to Origin

83Parallel App: Scientific/TechnicalFFT Kernel: 1D complex number FFT

• 2 matrix transpose phases => all-to-all communication• Sequential time for n data points: O(n log n)• Example is 1 million point data set

LU Kernel: dense matrix factorization• Blocking helps cache miss rate, 16x16• Sequential time for nxn matrix: O(n3)• Example is 512 x 512 matrix

84FFT Kernel

barnes model

8888

16161616

32323232

64646464

Hit


Remote miss to home

3-hop miss to remote cache

Processor count

Average cycles per memory reference

Barnes

0.9993

0.017

0

0.056

0.9993

0.017

0

0.056

0.9993

0.017

0

0.068

0.9993

0.017

0

0.068

fft model

8888

16161616

32323232

64646464

Hit


Remote miss to home


Processor count

Average cycles per reference

FFT

0.9562

3.281

0

0.728

0.9589

3.009

0

0.784

0.9753

1.598

0.0125

1.003

0.9799

1.19

0.0125

1.02

lu model

8888

16161616

32323232

64646464

Hit


Remote miss to home


Processor count


LU

0.9925

0.493

0

0.252

0.9932

0.425

0

0.252

0.9931

0.3995

0

0.374

0.9945

0.272

0.0125

0.357

ocean model

8888

16161616

32323232

64646464

Cache hit

Local miss

Remote miss



Ocean

0.9768

1.734

0

0.392

0.9859

0.6885

0.4

0.392

0.9842

0.6715

0.765

0.476

0.9584

2.312

1.74

0.476

performance model 8->64 proc

Directory based performance8-> 64 processors

cache size512KB

block size64bytes

16

miss local8585

miss home125150

miss remote140170

collect data

procinstruction countlocal penaltiesAll remote accessesThree hopMiss penaltyTotal Miss penelty/referenceSync timemisses including all upgrades of shared clean linesall shared misses including upgradesMR localR MR RemoteWr Miss remoteWr Hit Sh Clean LocalWr Hit Sh Clean RemoteWr Hit Inval ReqWr Miss Inv Reqmisses satisfied locally in 1000 processor cyclesmisses remote in 1000 processor cyclesratio of local to remote misses (by place of service--doesn't count invalidations)

FFT

80.42.9580.5%0.50.7283.6860.00034.380.93.480.5200.380007.221.087.43

160.42.7030.6%0.60.7843.4870.00094.110.923.180.5600.360006.61.176.32

320.41.45350.6%0.61.0182.47150.00432.470.771.710.590.010.170003.531.223.13

640.40.94350.6%0.61.0351.97850.01342.010.91.110.60.010.290002.281.262.3

0

LU0

80.40.2890.2%0.20.2520.5410.17770.750.420.340.1800.2400.002800.950.53.23

160.40.2380.2%0.20.2520.490.23440.680.40.280.1800.2200.004200.760.52.74

320.40.21250.2%0.20.3740.58650.33320.690.440.250.2200.2200.006100.630.552.09

640.40.1360.2%0.20.3720.5080.40510.550.380.160.2100.160.010.00800.370.491.46

0

Barnes0

80.40.0170.0%0.00.0560.0730.01760.070.050.020.0400000000

80.40.0170.0%0.00.0560.0730.04670.070.060.020.0400000000

80.40.0170.0%0.00.0680.0850.06380.070.060.020.0400000000

80.40.0170.0%0.00.0680.0850.10240.070.070.020.0400000000

0

Ocean0

80.41.36850.3%0.30.3921.76050.07982.320.711.610.2800.4300.18810.01514.310.747.27

160.40.24650.6%0.30.7921.03850.11811.411.120.290.600.5200.42790.01510.771.621.34

320.40.16150.8%0.31.2411.40250.25091.581.390.190.7900.600.55190.01380.512.091

640.41.43651.4%0.32.2163.65250.49064.162.471.691.4401.0300.78230.09624.513.851.88

Volrend

80.40.07650.0%0.00.710.020.09000.020.02360.00010.241.530.170.19

160.40.04250.0%0.00.760.040.05000.030.02810.00010.131.750.070.29

320.40.02550.0%0.00.780.040.03000.030.02910.00030.081.760.040.68

640.40.0170.0%0.00.810.050.02000.030.02970.00050.041.760.032.11

Anatomy of a Memory ReferenceTime weighted

FFTHitMiss to local memoryRemote miss to home3-hop miss to remote cacheFFTHitMiss to local memoryRemote miss to home3-hop miss to remote cache

896%4%0%1%81.03.30.00.7

1695.9%4%0%1%161.03.00.00.8

3298%2%0%1%321.01.60.01.0

6498%1%0%1%641.01.20.01.0

0100%0%0%0%01.00.00.00.0

LU100%0%0%0%LUHitMiss to local memoryRemote miss to home3-hop miss to remote cache

899%1%0%0%81.00.50.00.3

1699%1%0%0%161.00.40.00.3

3299%0%0%0%321.00.40.00.4

6499%0%0%0%641.00.30.00.4

0100%0%0%0%01.00.00.00.0

Barnes100%0%0%0%BarnesHitMiss to local memoryRemote miss to home3-hop miss to remote cache

8100%0%0%0%81.00.00.00.1

1699.9%0.0%0.0%0.0%161.00.00.00.1

32100%0%0%0%321.00.00.00.1

64100%0%0%0%641.00.00.00.1

0100%0%0%0%01.00.00.00.0

Ocean100%0%0%0%OceanCache hitLocal missRemote miss3-hop miss to remote cache

898%2%0%0%81.01.70.00.4

1699%1%0%0%161.00.70.40.4

3298%1%1%0%321.00.70.80.5

6496%3%1%0%641.02.31.70.5

1.6%1.4

2.15

&f

Page &p

85LU kernel

barnes model

8888

16161616

32323232

64646464

Hit


Remote miss to home


Processor count

Average cycles per memory reference

Barnes

0.9993

0.017

0

0.056

0.9993

0.017

0

0.056

0.9993

0.017

0

0.068

0.9993

0.017

0

0.068

fft model

8888

16161616

32323232

64646464

Hit


Remote miss to home


Processor count


FFT

0.9562

3.281

0

0.728

0.9589

3.009

0

0.784

0.9753

1.598

0.0125

1.003

0.9799

1.19

0.0125

1.02

lu model

8888

16161616

32323232

64646464

Hit


Remote miss to home


Processor count


LU

0.9925

0.493

0

0.252

0.9932

0.425

0

0.252

0.9931

0.3995

0

0.374

0.9945

0.272

0.0125

0.357

ocean model

8888

16161616

32323232

64646464

Cache hit

Local miss

Remote miss



Ocean

0.9768

1.734

0

0.392

0.9859

0.6885

0.4

0.392

0.9842

0.6715

0.765

0.476

0.9584

2.312

1.74

0.476

performance model 8->64 proc

Directory based performance8-> 64 processors

cache size512KB

block size64bytes

16

miss local8585

miss home125150

lecture 1: course introduction and overviewsoner/classes/cs-4431/pdf-slides/lecture-12.pdf ·...

Documents