memory hierarchies and thread performancecseweb.ucsd.edu/groups/csag/html/teaching/cse160s05/...cse...

22
Lecture #6, Slide 1 CSE 160 Chien, Spring 2005 Memory Hierarchies and Thread Performance Last Time » Parallelism = Performance » Threads and Nodes as Basis for Large-scale Parallelism Today » Caches and Performance » Threads and Caches » Parallel Machine Structure Reminders/Announcements » Homework #2 went out Tuesday (Due April 21) » Homework #1 already returned (see Sagnik) Lecture #6, Slide 2 CSE 160 Chien, Spring 2005 The Processor-Memory Gap 1980 1985 1990 1995 2000 2005 10 0 10 1 10 2 10 3 10 4 10 5 Year Performance Memory (DRAM) Processor

Upload: others

Post on 03-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Lecture #6, Slide 1CSE 160 Chien, Spring 2005

    Memory Hierarchies and Thread Performance

    • Last Time» Parallelism = Performance» Threads and Nodes as Basis for Large-scale Parallelism

    • Today» Caches and Performance» Threads and Caches» Parallel Machine Structure

    • Reminders/Announcements» Homework #2 went out Tuesday (Due April 21)» Homework #1 already returned (see Sagnik)

    Lecture #6, Slide 2CSE 160 Chien, Spring 2005

    The Processor-Memory Gap

    1980 1985 1990 1995 2000 200510

    0

    101

    102

    103

    104

    105

    Year

    Per

    form

    an

    ce

    Memory (DRAM)

    Processor

  • 2

    Lecture #6, Slide 3CSE 160 Chien, Spring 2005

    Caches

    • Caches are Critical to “span the gap”» Exploit Locality; Keep the needed data in a small fast

    memory» Avoid Accessing the Slow Memory

    • If Successful» Processor can execute faster, memory requests are OFTEN

    satisfied quickly» Memory system can be much lower bandwidth, many fewer

    requests actually go to memory

    Lecture #6, Slide 4CSE 160 Chien, Spring 2005

    A Typical Memory Hierarchy

    small expensive $/bit

    large cheap $/bitMemory

    on-chip caches

    off-chip cache

    main memory

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

  • 3

    Lecture #6, Slide 5CSE 160 Chien, Spring 2005

    Other Attributes of A Memory Hierarchy

    Memory

    on-chip caches

    off-chip cache

    main memory

    Low latency High Bandwidth

    High Latency Low Bandwidth

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

    Lecture #6, Slide 6CSE 160 Chien, Spring 2005

    Finding Cache Data

    • For fast access, must be able to find data quickly• Lookup: discard a few address bits, access the fast

    memory• Tags indicate the data that’s really there• Why does this work?

    » Memory location + tag -> complete address

    Access +Lookup:

    Memory Address

    CacheIndex

    Cache Tag

    Cache

    Data Tags

  • 4

    Lecture #6, Slide 7CSE 160 Chien, Spring 2005

    Cache Access

    • Part of Memory Address applied to cache• Tags checked against remaining address bits• If match -> Hit!, use data• No match -> Miss, retrieve data from memory• This works pretty well... but there are some

    complications...

    Data Tags

    Memoryaddress lines

    = Hit?

    Lecture #6, Slide 8CSE 160 Chien, Spring 2005

    Multiword Cache Entries

    • Cache entries are several contiguous words of memory

    • Shared tag, involves only contiguous addresses» reduces tag overhead» exploits spatial locality

    • A few address lines select the word from the line on a “hit”.... Which ones?

    Cache Blocks Tags

  • 5

    Lecture #6, Slide 9CSE 160 Chien, Spring 2005

    Accessing a Sample Cache• 64 KB cache, direct-mapped, 32-byte cache block size

    31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

    valid tag data

    64 KB

    / 32 bytes = 2 K

    cache blocks/sets

    11

    =256

    32

    16

    hit/miss

    012.........

    ...204520462047

    word offset

    Lecture #6, Slide 10CSE 160 Chien, Spring 2005

    Accessing a Sample Cache• 32 KB cache, 2-way set-associative, 16-byte block size

    31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

    valid tag data

    32 KB

    / 16 bytes / 2 = 1 K

    cache sets

    10

    =

    18

    hit/miss

    012.........

    ...102110221023

    word offset

    tag datavalid

    =

  • 6

    Lecture #6, Slide 11CSE 160 Chien, Spring 2005

    Example – HP/Compaq/DEC Alpha 21164 Caches

    21164 CPUcore

    InstructionCache

    DataCache

    UnifiedL2

    Cache

    Off-ChipL3 Cache

    • ICache and DCache -- 8 KB, DM, 32-byte lines• L2 cache -- 96 KB, 3-way SA, 32-byte lines• L3 cache -- 1 MB, DM, 32-byte lines

    Lecture #6, Slide 12CSE 160 Chien, Spring 2005

    Example

    • 2.4 Ghz Dual Xeon Server» Dual Processors» L1 Cache – 8KB, 2-3 clocks, ~10GB/s » L2 Cache – 512KB, 6GB/s » Memory System – 100’s of clocks, 2GB/s shared

    • Caches must be effective for this system to work well.• What is the ratio, just on bandwidth alone? Block

    sizes…• What type of working set is possible for an

    application? (how much locality)

    This is comparable to the FWGrid Opteron Nodes!

  • 7

    Lecture #6, Slide 13CSE 160 Chien, Spring 2005

    Caches and Single Thread Performance

    • So, why does this matter?» Single thread Performance is tied to cache performance» High Locality, high cache hit rate is essential» Low hit rate => low thread performance» Single thread performance is the building block for high parallel

    performance• For good parallel performance, single threads must make good

    use of cache hierarchies

    16 xThreads

    OK Good Excellent

    Single Thread Performance

    Lecture #6, Slide 14CSE 160 Chien, Spring 2005

    Multithreaded, Hyperthreading, or Simultaneous MT Processors

    • Typically share the same memory hierarchy» Same caches, same “ports” to access the cache» No increased capacity, no increased bandwidth » Try to get higher throughput out of the processor

    • Collective use of the Memory Hierarchy must have high locality» Capacity: Threads should have small “working sets”, Overlapping

    working sets even better» Bandwidth: Not too many misses, or Miss bursts at different times

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

  • 8

    Lecture #6, Slide 15CSE 160 Chien, Spring 2005

    Multi-Processors: Cache Coherence Problem

    • Multiple Processors Sharing a single Memory System» Separate cache hierarchy for each processor

    • Copies of a datum may exist in multiple Caches» Writes take a finite amount of time to propagate» Caches might contain a value and respond while another

    processor is writing

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

    Memory

    Obj A Obj A

    This is structure of the FWGrid Opteron Nodes!

    Lecture #6, Slide 16CSE 160 Chien, Spring 2005

    Coherence Thru Write-back Caches

    • First write captures an exclusive copy of the cache block• Successive writes can be buffered in the cache; reducing the required

    bus bandwidth; key for scalability• Cache must store exclusive and modified state, complicates the cache• Eventually, the dirty, modified block must be written back into the main

    memory (or supplied to another cache) • Basic scheme used for many machines...

    ProcWrite ALoad AWrite ALoad AWrite A

    BusWrite A............Flush A

  • 9

    Lecture #6, Slide 17CSE 160 Chien, Spring 2005

    Basic Write-back Cache Protocol

    • Invalid, Shared, Dirty (exclusive); A/B -- Get A, do B• Dirty state is obtained through an ownership bus

    transaction, exiting this state requires a flush to write the data back

    • BusRd gets a read copy, BusRdX gets a write copy• Work out all of the transactions. Just about the simplest.

    S DBusRd/Flush

    ProcWr/BusRdX

    PrRd/ --PrRd/--

    I

    PrWr/--BusRd/--

    BusRdX/FlushBusRdX/--

    PrWr/BusRdXPrRd/BusRd

    Lecture #6, Slide 18CSE 160 Chien, Spring 2005

    Illinois Protocol (MESI)

    • Extend 3-state protocol with VE (valid, exclusive)• Idea: Optimize BusRd and BusRdX into one

    transaction, keep track of private read copies.• Natural extension of protocol is other ways.• Used in SGI Challenge arrays and MOST

    multiprocessors…

  • 10

    Lecture #6, Slide 19CSE 160 Chien, Spring 2005

    Illinois Protocol (FSM)

    • Other transitions identical to 3-state protocol• “remembers” whether have the only copy

    VE

    D

    I

    S

    BusRd/FlushPrWr/--

    PrRd/--PrRd/BusRd(S)

    BusRdX/Flush

    Lecture #6, Slide 20CSE 160 Chien, Spring 2005

    Complications -- Life ain’t so simple...

    • Write buffers» Delay write visibility, how does this affect memory

    consistency model?» Aggregation, order can be changed as well. Flush

    instructions if necessary• Split transaction busses

    » Operations are not atomic, may be many outstanding at the same time (again out of order)

    » No problem if the operations are not related (as above)• Cache designs require simultaneous access

    » tag replication or dual porting» multi-level caches can alleviate this problem

    • Reality: cache controllers compete with processors as the “really hard things to get right” in systems design

  • 11

    Lecture #6, Slide 21CSE 160 Chien, Spring 2005

    Caches and Multiple Processors

    • So, why does this matter?» Coordination and data sharing amongst threads can be expensive» Cross-processor – hundreds of cycles; All the way down and all the way

    back up» Ideally: Threads operate on independent data.

    – Allow execution without interference in a single processor– Allow execution with high efficiency on separate processors

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

    Cache

    L2 Cache

    P

    Cache

    L3 Cache

    Memory

    Parallel Machine Taxonomy

  • 12

    Lecture #6, Slide 23CSE 160 Chien, Spring 2005

    Flynn’s Taxonomy

    • Flynn (1966) Classified machines by data and control streams

    Multiple Instruction Multiple Data(MIMD)

    Multiple Instruction Single Data(MISD)

    Single Instruction Multiple DataSIMD

    Single InstructionSingle Data(SISD)

    Lecture #6, Slide 24CSE 160 Chien, Spring 2005

    SISD

    • SISD» Model of serial von Neumann machine» Logically, single control processor

    P M

  • 13

    Lecture #6, Slide 25CSE 160 Chien, Spring 2005

    SIMD

    • SIMD» All processors execute the same program in lockstep» Data that each processor sees may be different» Single control processor» Individual processors can be turned on/off at each cycle

    (“masking”)» CM-2, MasPar are some examples

    Lecture #6, Slide 26CSE 160 Chien, Spring 2005

    CM2 Architecture

    • CM2 had 8K-64K 1 bit custom processors • Data Vault provides peripheral mass storage• Programs had normal sequential control flow but all operations happened in

    parallel so CM HW supported data-parallel programming model

    16K proc

    Seq 0

    16K proc

    Seq 2

    16K proc

    Seq 3

    16K proc

    Seq 1

    Nexus

    Frontend 2

    Frontend 0

    Frontend 1

    Frontend 3

    CM I/O SystemDataVault

    DataVault

    DataVault

    GraphicDisplay

  • 14

    Lecture #6, Slide 27CSE 160 Chien, Spring 2005

    MIMD

    • All processors execute their own set of instructions• Processors operate on separate data streams• No centralized clock implied• MIMD machines may be shared memory or

    message-passing• SP-2, T3E, Origin 2K, Tera MTA, Clusters, Cray’s,

    etc.• => Most Machines built from traditional processors

    have this structure

    Lecture #6, Slide 28CSE 160 Chien, Spring 2005

    Models for Communication

    • Parallel program = program composed of tasks (processes) which communicate to accomplish an overall computational goal

    • Two prevalent models for communication:» Message passing (MP)» Shared memory (SM)

  • 15

    Lecture #6, Slide 29CSE 160 Chien, Spring 2005

    • Processes in message passing program communicate by passing messages

    • Basic message passing primitives» Send(parameter list)» Receive(parameter list)» Parameters depend on the software and can be complex

    Message Passing Communication

    A B

    Lecture #6, Slide 30CSE 160 Chien, Spring 2005

    Flavors of message passing• Synchronous used for routines that return when the message

    transfer has been completed» Synchronous send waits until the complete message can be

    accepted by the receiving process before sending the message (send suspends until receive)

    » Synchronous receive will wait until the message it is expecting arrives (receive suspends until message sent)

    » Also called blocking

    A B

    request to send

    acknowledgement

    message

  • 16

    Lecture #6, Slide 31CSE 160 Chien, Spring 2005

    Asynchronous (Non-blocking) Message Passing

    • Nonblocking sends return whether or not the message has been received» If receiving processor not ready, message may be stored in

    message buffer» Message buffer used to hold messages being sent by A prior to

    being accepted by receive in B

    » MPI: – routines that use a message buffer and return after their local actions

    complete are blocking (even though message transfer may not be complete)

    – Routines that return immediately are non-blocking

    A B

    message buffer

    Lecture #6, Slide 32CSE 160 Chien, Spring 2005

    Architectural support for Message Passing

    • Interconnection network should provide connectivity, low latency, high bandwidth

    • Many interconnection networks developed over last 2 decades» Small-size: non-blocking (Xbar)» Medium-size: non-blocking (Xbar)» Large-scale: Mesh, torus,

    – multi-stage networks

    InterconnectionNetwork

    …Processor

    Localmemory

    Basic Message Passing Multicomputer

  • 17

    Lecture #6, Slide 33CSE 160 Chien, Spring 2005

    Shared Memory Communication

    • Processes in shared memory program communicate by accessing shared variables and data structures

    • Basic shared memory primitives» Read to a shared variable (or object)» Write to a shared variable (or object)

    InterconnectionNetwork…Processors

    memories

    Basic Shared Memory Multiprocessor Architecture

    Lecture #6, Slide 34CSE 160 Chien, Spring 2005

    • Conflicts may arise if multiple processes want to write to a shared variable at the same time.

    • Programmer, language, and/or architecture must provide means of resolving conflicts

    Accessing Shared Objects

    Shared variable x

    +1proc. A

    +1proc. B

    Process A,B:read xcompute x+1write x

  • 18

    Lecture #6, Slide 35CSE 160 Chien, Spring 2005

    Architectural Support for Shared Memory

    • 4 basic types of interconnection media:» Bus (not really used any more)» Crossbar switch» Multistage network» Interconnection network with distributed shared memory

    Lecture #6, Slide 36CSE 160 Chien, Spring 2005

    Limited Scalable Media• Crossbar

    » Crossbar switch connects m processors and n memories with distinct paths between each processor/memory pair

    » Crossbar provides uniform access to shared memory (UMA)» O(mn) switches required for m processors and n memories» Crossbar scalable in terms of performance but not in terms of cost, used

    for basic switching mechanism in SP2

    P1

    M1

    P4P5

    P3P2

    M2 M3 M4 M5

  • 19

    Lecture #6, Slide 37CSE 160 Chien, Spring 2005

    Multistage Networks

    • Multistage networks provide more scalable performance than bus but less costly to scale than crossbar

    • Typically max{logn,logm} stages connect n processors and m shared memories

    • “Omega” networks (butterfly, shuffle-exchange) commonly used for multistage network

    • Multistage network used for CM-5 (fat-tree connects processor/memory pairs), BBN Butterfly (butterfly), IBM RP3 (omega)

    P1

    P4P5

    P3P2

    M1M2M3M4M5

    Stage1

    Stage2

    Stagek

    Lecture #6, Slide 38CSE 160 Chien, Spring 2005

    Omega Networks

    • Butterfly multistage» Used for BBN Butterfly, TC2000

    • Shuffle multistage» Used for RP3, SP2 high

    performance switch

    4D

    C

    1

    B3

    2A

    CD

    B

    1

    4A

    32

  • 20

    Lecture #6, Slide 39CSE 160 Chien, Spring 2005

    Fat-tree Interconnect

    • Bandwidth is increased towards the root• Used for data network for CM-5 (MIMD MPP)

    » 4 leaf nodes, internal nodes have 2 or 4 children• To route from leaf A to leaf B, pick random switch C in the least

    common ancestor fat node of A and B, take unique tree route fromA to C and from C to B

    Binary fat-tree in which all internal nodes have two children

    Lecture #6, Slide 40CSE 160 Chien, Spring 2005

    Distributed Shared Memory

    • Memory is physically distributed but programmed as shared memory» Programmers find shared memory paradigm desirable» Shared memory distributed among processors, accesses may be sent as

    messages» Access to local memory and global shared memory creates NUMA (non-uniform

    memory access architectures)» BBN butterfly is NUMA shared memory multiprocessor

    InterconnectionNetwork

    Processorand localmemory

    Sharedmemory

    PM

    PM

    PM

    BBNbutterflyinterconnect

  • 21

    Lecture #6, Slide 41CSE 160 Chien, Spring 2005

    Using both Shared Memory and Message Passing together

    • Clusters of SMPs may be effectively programmed using both SM and MP» Shared Memory used within a multiple processor machine/node» Message Passing used between nodes

    • FWGrid has this structure

    Lecture #6, Slide 42CSE 160 Chien, Spring 2005

    Summary

    • Single Thread Performance is the building block for parallel performance» Threads should have high data locality, and share caches

    efficiently for MT, HT, SMT to work well

    • Threads run on Multiprocessors must be decoupled to achieve good parallelism» High data locality, but moderate sharing of data

    • Parallel Machine Taxonomy» SISD (traditional), SIMD, MIMD» Shared and Distributed Memory

  • 22

    Lecture #6, Slide 43CSE 160 Chien, Spring 2005