parallel computer iit kanpur

NPTEL Online - IIT Bombay

Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Dr. Mainak Chaudhuri

Instructor

file:///E|/parallel_com_arch/lecture1/main.html[6/13/2012 11:08:53 AM]

Lawla10do

Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law" Lecture 1: "Evolution of Processor Architecture" The Lecture Contains: Multi-core: The Ultimate Dose of Moores Law A gentle introduction to the multi-core landscape as a tale of four decades of glory and success Mind-boggling Trends in Chip Industry Agenda Unpipelined Microprocessors Pipelining Pipelining Hazards Control Dependence Data Dependence Structural Hazard Out-of-order Execution Multiple Issue Out-of-order Multiple Issue

file:///E|/parallel_com_arch/lecture1/1_1.htm[6/13/2012 11:08:53 AM]

Lawla10do

Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law" Lecture 1: "Evolution of Processor Architecture" Mind-boggling Trends in Chip Industry:Long history since 1971 Introduction of Intel 4004 http://www.intel4004.com/ Today we talk about more than one billion transistors on a chip Intel Montecito (in market since July06) has 1.7B transistors Die size has increased steadily (what is a die?) Intel Prescott: 112mm2, Intel Pentium 4EE: 237 mm2, Intel Montecito: 596 mm2 Minimum feature size has shrunk from 10 micron in 1971 to 0.045 micron today

Agenda:Unpipelined microprocessors Pipelining: simplest form of ILP Out-of-order execution: more ILP Multiple issue: drink more ILP Scaling issues and Moores Law Why multi-core TLP and de-centralized design Tiled CMP and shared cache Implications on software Research directions

Unpipelined Microprocessors:Typically an instruction enjoys five phases in its life Instruction fetch from memory Instruction decode and operand register read Execute Data memory access Register write Unpipelined execution would take a long single cycle or multiple short cycles Only one instruction inside processor at any point in time


Lawla10do

Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law" Lecture 1: "Evolution of Processor Architecture" Pipelining:One simple observation Exactly one piece of hardware is active at any point in time Why not fetch a new instruction every cycle? Five instructions in five different phases Throughput increases five times (ideally) Bottom-line is If consecutive instructions are independent, they can be processed in parallel The first form of instruction-level parallelism (ILP)

Pipelining Hazards:Instruction dependence limits achievable parallelism Control and data dependence (aka hazards) Finite amount of hardware limits achievable parallelism Structural hazards Control dependence On average, every fifth instruction is a branch (coming from if-else, for, do-while,) Branches execute in the third phase Introduces bubbles unless you are smart

Control Dependence:

What do you fetch in X and y slots? Options: nothing, fall-through, learn past history and predict (today best predictors achieve on average 97% accuracy for SPEC2000)


Lawla10do

Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law" Lecture 1: "Evolution of Processor Architecture"

Data Dependence:

Take three bubbles?Back-to-back dependence is too frequent Solution: hardware bypass paths Allow the ALU to bypass the produced value in time: not always possible

Need a live bypass! (requires some negative time travel: not yet feasible in real world) No option but to take one bubble Bigger problems: load latency is often high; you may not find the data in cache

Structural Hazard:

Usual solution is to put more resources


Lawla10do

Objectives_template


Out-of-order Execution:


Lawla10do

Objectives_template


Multiple Issue:

Out-of-order Multiple Issue:Some hardware nightmares Complex issue logic to discover independent instructions Increased pressure on cache Impact of a cache miss is much bigger now in terms of lost opportunity Various speculative techniques are in place to ignore the slow and stupid memory Increased impact of control dependence Must feed the processor with multiple correct instructions every cycle One cycle of bubble means lost opportunity of multiple instructions Complex logic to verify


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 10: "Communication Architecture" Fundamentals of Parallel Computers Agenda Communication architecture Layered architecture Shared address Message passing Convergence Data parallel arch. [From Chapter 1 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 10: "Communication Architecture" AgendaConvergence of parallel architectures Fundamental design issues ILP vs. TLP

Communication architectureHistorically, parallel architectures are tied to programming models Diverse designs made it impossible to write portable parallel software But the driving force was the same: need for fast processing Today parallel architecture is seen as an extension of microprocessor architecture with a communication architecture Defines the basic communication and synchronization operations and provides hw/sw implementation of those

Layered architectureA parallel architecture can be divided into several layers Parallel applications Programming models: shared address, message passing, multiprogramming, data parallel, dataflow etc Compiler + libraries Operating systems support Communication hardware Physical communication medium Communication architecture = user/system interface + hw implementation (roughly defined by the last four layers) Compiler and OS provide the user interface to communicate between and synchronize threads


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 10: "Communication Architecture" Shared addressCommunication takes place through a logically shared portion of memory User interface is normal load/store instructions Load/store instructions generate virtual addresses The VAs are translated to PAs by TLB or page table The memory controller then decides where to find this PA Actual communication is hidden from the programmer The general communication hw consists of multiple processors connected over some medium so that they can talk to memory banks and I/O devices The architecture of the interconnect may vary depending on projected cost and target performance Communication medium

Interconnect could be a crossbar switch so that any processor can talk to any memory bank in one hop (provides latency and bandwidth advantages) Scaling a crossbar becomes a problem: cost is proportional to square of the size Instead, could use a scalable switch-based network; latency increases and bandwidth decreases because now multiple processors contend for switch ports Communication medium From mid 80s shared bus became popular leading to the design of SMPs Pentium Pro Quad was the first commodity SMP Sun Enterprise server provided a highly pipelined wide shared bus for scalability reasons; it also distributed the memory to each processor, but there was no local bus on the boards i.e. the memory was still symmetric (must use the shared bus) NUMA or DSM architectures provide a better solution to the scalability problem; the symmetric view is replaced by local and remote memory and each node (containing processor(s) with caches, memory controller and router) gets connected via a scalable network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha GS320, Alpha/HP GS1280 etc.


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 10: "Communication Architecture" Message passingVery popular for large-scale computing The system architecture looks exactly same as DSM, but there is no shared memory The user interface is via send/receive calls to the message layer The message layer is integrated to the I/O system instead of the memory system Send specifies a local data buffer that needs to be transmitted; send also specifies a tag A matching receive at dest. node with the same tag reads in the data from kernel space buffer to user memory Effectively, provides a memory-to-memory copy Actual implementation of message layer Initially it was very topology dependent A node could talk only to its neighbors through FIFO buffers These buffers were small in size and therefore while sending a message send would occasionally block waiting for the receive to start reading the buffer (synchronous message passing) Soon the FIFO buffers got replaced by DMA (direct memory access) transfers so that a send can initiate a transfer from memory to I/O buffers and finish immediately (DMA happens in background); same applies to the receiving end also The parallel algorithms were designed specifically for certain topologies: a big problem To improve usability of machines, the message layer started providing support for arbitrary source and destination (not just nearest neighbors) Essentially involved storing a message in intermediate hops and forwarding it to the next node on the route Later this store-and-forward routing got moved to hardware where a switch could handle all the routing activities Further improved to do pipelined wormhole routing so that the time taken to traverse the intermediate hops became small compared to the time it takes to push the message from processor to network (limited by node-to-network bandwidth) Examples include IBM SP2, Intel Paragon Each node of Paragon had two i860 processors, one of which was dedicated to servicing the network (send/recv. etc.)


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 10: "Communication Architecture" ConvergenceShared address and message passing are two distinct programming models, but the architectures look very similar Both have a communication assist or network interface to initiate messages or transactions In shared memory this assist is integrated with the memory controller In message passing this assist normally used to be integrated with the I/O, but the trend is changing There are message passing machines where the assist sits on the memory bus or machines where DMA over network is supported (direct transfer from source memory to destination memory) Finally, it is possible to emulate send/recv. on shared memory through shared buffers and flags Possible to emulate a shared virtual mem. on message passing machines through modified page fault handlers

Data parallel arch.Array of processing elements (PEs) Each PE operates on a data element within a large matrix The operation is normally specified by a control processor Essentially, single-instruction-multiple-data (SIMD) architectures So the parallelism is exposed at the data level Processor arrays were outplayed by vector processors in mid-70s Vector processors provide a more general framework to operate on large matrices in a controlled fashion No need to design a specialized processor array in a certain topology Advances in VLSI circuits in mid-80s led to design of large arrays of single-bit PEs Also, arbitrary communication (rather than just nearest neighbor) was made possible Gradually, this architecture evolved into SPMD (single-program-multiple-data) All processors execute the same copy of a program in a more controlled fashion But parallelism is expressed by partitioning the data Essentially, the same as the way shared memory or message passing machines are used for running parallel applications


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 11: "Design Issues in Parallel Computers" Fundamentals of Parallel Computers Dataflow architecture Systolic arrays A generic architecture Design issues Naming Operations Ordering Replication Communication cost ILP vs. TLP [From Chapter 1 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 11: "Design Issues in Parallel Computers" Dataflow architectureExpress the program as a dataflow graph Logical processor at each node is activated when both operands are available Mapping of logical nodes to PEs is specified by the program On finishing an operation, a message or token is sent to the destination processor Arriving tokens are matched against a token store and a match triggers the operation

Systolic arraysReplace the pipeline within a sequential processor by an array of PEs

Each PE may have small instruction and data memory and may carry out a different operation Data proceeds through the array at regular heartbeats (hence the name) The dataflow may be multi-directional or optimized for specific algorithms Optimize the interconnect for specific application (not necessarily a linear topology) Practical implementation in iWARP Uses general purpose processors as PEs Dedicated channels between PEs for direct register to register communication

A generic architectureIn all the architectures we have discussed thus far a node essentially contains processor(s) + caches, memory and a communication assist (CA) CA = network interface (NI) + communication controller The nodes are connected over a scalable network The main difference remains in the architecture of the CA And even under a particular programming model (e.g., shared memory) there is a lot of choices in the design of the CA Most innovations in parallel architecture take place in the communication assist (also called communication controller or node controller)


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 11: "Design Issues in Parallel Computers" Design issuesNeed to understand architectural components that affect software Compiler, library, program User/system interface and hw/sw interface How programming models efficiently talk to the communication architecture? How to implement efficient primitives in the communication layer? In a nutshell, what issues of a parallel machine will affect the performance of the parallel applications? Naming, Operations, Ordering, Replication, Communication cost

NamingHow are the data in a program referenced? In sequential programs a thread can access any variable in its virtual address space In shared memory programs a thread can access any private or shared variable (same load/store model of sequential programs) In message passing programs a thread can access local data directly Clearly, naming requires some support from hw and OS Need to make sure that the accessed virtual address gets translated to the correct physical address

OperationsWhat operations are supported to access data? For sequential and shared memory models load/store are sufficient For message passing models send/receive are needed to access remote data For shared memory, hw (essentially the CA) needs to make sure that a load/store operation gets correctly translated to a message if the address is remote For message passing, CA or the message layer needs to copy data from local memory and initiate send, or copy data from receive buffer to user area in local memory

OrderingHow are the accesses to the same data ordered? For sequential model, it is the program order: true dependence order For shared memory, within a thread it is the program order, across threads some valid interleaving of accesses as expected by the programmer and enforced by synchronization operations (locks, point-to-point synchronization through flags, global synchronization through barriers) Ordering issues are very subtle and important in shared memory model (some microprocessor re-ordering tricks may easily violate correctness when used in shared memory context) For message passing, ordering across threads is implied through point-to-point send/receive pairs (producer-consumer relationship) and mutual exclusion is inherent (no shared variable)


Lawla10do

Objectives_template

Module 6: "Fundamentals of Parallel Computers" Lecture 11: "Design Issues in Parallel Computers" ReplicationHow is the shared data locally replicated? This is very important for reducing communication traffic In microprocessors data is replicated in the cache to reduce memory accesses In message passing, replication is explicit in the program and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache hierarchy so that subsequent accesses can be fast; this is totally hidden from the program and therefore the hardware must provide a layer that keeps track of the most recent copies of the data (this layer is central to the performance of shared memory multiprocessors and is called the cache coherence protocol)

Communication costThree major components of the communication architecture that affect performance Latency: time to do an operation (e.g., load/store or send/recv.) Bandwidth: rate of performing an operation Overhead or occupancy: how long is the communication layer occupied doing an operation Latency Already a big problem for microprocessors Even bigger problem for multiprocessors due to remote operations Must optimize application or hardware to hide or lower latency (algorithmic optimizations or prefetching or overlapping computation with communication) Bandwidth How many ops in unit time e.g. how many bytes transferred per second Local BW is provided by heavily banked memory or faster and wider system bus Communication BW has two components: 1. node-to-network BW (also called network link BW) measures how fast bytes can be pushed into the router from the CA, 2. within-network bandwidth: affected by scalability of the network and architecture of the switch or router Linear cost model: Transfer time = T0 + n/B where T0 is start-up overhead, n is number of bytes transferred and B is BW Not sufficient since overlap of comp. and comm. is not considered; also does not count how the transfer is done (pipelined or not) Better model: Communication time for n bytes = Overhead + CA occupancy + Network latency + Size/BW + Contention T(n) = O V + O C + L + n/B + TC Overhead and occupancy may be functions of n Contention depends on the queuing delay at various components along the communication path e.g. waiting time at the communication assist or controller, waiting time at the router etc. Overall communication cost = frequency of communication x (communication time overlap with useful computation) Frequency of communication depends on various factors such as how the program is


Lawla10do

Objectives_template

written or the granularity of communication supported by the underlying hardware

ILP vs. TLPMicroprocessors enhance performance of a sequential program by extracting parallelism from an instruction stream (called instruction-level parallelism) Multiprocessors enhance performance of an explicitly parallel program by running multiple threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger granularity compared to ILP In multiprocessors ILP and TLP work together Within a thread ILP provides performance boost Across threads TLP provides speedup over a sequential version of the parallel program


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 12: "Steps in Writing a Parallel Program" Parallel Programming Prolog: Why bother? Agenda Ocean current simulation Galaxy simulation Ray tracing Writing a parallel program Some definitions Decomposition of Iterative Equation Solver Static assignment Dynamic assignment Decomposition types Orchestration Mapping An example Sequential program [From Chapter 2 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 12: "Steps in Writing a Parallel Program" Prolog: Why bother?As an architect why should you be concerned with parallel programming? Understanding program behavior is very important in developing high-performance computers An architect designs machines that will be used by the software programmers: so need to understand the needs of a program Helps in making design trade-offs and cost/performance analysis i.e. what hardware feature is worth supporting and what is not Normally an architect needs to have a fairly good knowledge in compilers and operating systems

AgendaParallel application case studies Steps in writing a parallel program Example

Ocean current simulationRegular structure, scientific computing, important for weather forecast Want to simulate the eddy current along the walls of ocean basin over a period of time Discretize the 3-D basin into 2-D horizontal grids Discretize each 2-D grid into points One time step involves solving the equation of motion for each grid point Enough concurrency within and across grids After each time step synchronize the processors

Galaxy simulationSimulate the interaction of many stars evolving over time Want to compute force between every pair of stars for each time step Essentially O(n2 ) computations (massive parallelism) Hierarchical methods take advantage of square law If a group of stars is far enough it is possible to approximate the group entirely by a single star at the center of mass Essentially four subparts in each step: divide the galaxy into zones until further division does not improve accuracy, compute center of mass for each zone, compute force, update star position based on force Lot of concurrency across stars


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 12: "Steps in Writing a Parallel Program" Ray tracingWant to render a scene using ray tracing Generate rays through pixels in the image plane The rays bounce from objects following reflection/refraction laws New rays get generated: tree of rays from a root ray Need to correctly simulate paths of all rays The outcome is color and opacity of the objects in the scene: thus you render a scene Concurrency across ray trees and subtrees

Writing a parallel programStart from a sequential description Identify work that can be done in parallel Partition work and/or data among threads or processes Decomposition and assignment Add necessary communication and synchronization Orchestration Map threads to processors (Mapping) How good is the parallel program? Measure speedup = sequential execution time/parallel execution time = number of processors ideally

Some definitionsTask Arbitrary piece of sequential work Concurrency is only across tasks Fine-grained task vs. coarse-grained task: controls granularity of parallelism (spectrum of grain: one instruction to the whole sequential program) Process/thread Logical entity that performs a task Communication and synchronization happen between threads Processors Physical entity on which one or more processes execute

Decomposition of Iterative Equation SolverFind concurrent tasks and divide the program into tasks Level or grain of concurrency needs to be decided here Too many tasks: may lead to too much of overhead communicating and synchronizing between tasks Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy Number of tasks may vary dynamically New tasks may get created as the computation proceeds: new rays in ray tracing Number of available tasks at any point in time is an upper bound on the achievable speedup


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 12: "Steps in Writing a Parallel Program" Static assignmentGiven a decomposition it is possible to assign tasks statically For example, some computation on an array of size N can be decomposed statically by assigning a range of indices to each process: for k processes P0 operates on indices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,, Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and low-overhead What if the nature computation depends on the index? For certain index ranges you do some heavy-weight computation while for others you do something simple Is there a problem?

Dynamic assignmentStatic assignment may lead to load imbalance depending on how irregular the application is Dynamic decomposition/assignment solves this issue by allowing a process to dynamically choose any available task whenever it is done with its previous task Normally in this case you decompose the program in such a way that the number of available tasks is larger than the number of processes Same example: divide the array into portions each with 10 indices; so you have N/10 tasks An idle process grabs the next available task Provides better load balance since longer tasks can execute concurrently with the smaller ones Dynamic assignment comes with its own overhead Now you need to maintain a shared count of the number of available tasks The update of this variable must be protected by a lock Need to be careful so that this lock contention does not outweigh the benefits of dynamic decomposition More complicated applications where a task may not just operate on an index range, but could manipulate a subtree or a complex data structure Normally a dynamic task queue is maintained where each task is probably a pointer to the data The task queue gets populated as new tasks are discovered

Decomposition typesDecomposition by data The most commonly found decomposition technique The data set is partitioned into several subsets and each subset is assigned to a process The type of computation may or may not be identical on each subset Very easy to program and manage Computational decomposition Not so popular: tricky to program and manage All processes operate on the same data, but probably carry out different kinds of computation More common in systolic arrays, pipelined graphics processor units (GPUs) etc.


Lawla10do

Objectives_template

OrchestrationInvolves structuring communication and synchronization among processes, organizing data structures to improve locality, and scheduling tasks This step normally depends on the programming model and the underlying architecture Goal is to Reduce communication and synchronization costs Maximize locality of data reference Schedule tasks to maximize concurrency: do not schedule dependent tasks in parallel Reduce overhead of parallelization and concurrency management (e.g., management of the task queue, overhead of initiating a task etc.)


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 12: "Steps in Writing a Parallel Program" MappingAt this point you have a parallel program Just need to decide which and how many processes go to each processor of the parallel machine Could be specified by the program Pin particular processes to a particular processor for the whole life of the program; the processes cannot migrate to other processors Could be controlled entirely by the OS Schedule processes on idle processors Various scheduling algorithms are possible e.g., round robin: process#k goes to processor#k NUMA-aware OS normally takes into account multiprocessor-specific metrics in scheduling How many processes per processor? Most common is one-to-one

An exampleIterative equation solver Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j]) Pad the n by n grid to (n+2) by (n+2) to avoid corner problems Update only interior n by n grid One iteration consists of updating all n2 points in-place and accumulating the difference from the previous value at each point If the difference is less than a threshold, the solver is said to have converged to a stable grid equilibrium

Sequential programint n; float **A, diff; begin main() read (n); /* size of grid */ Allocate (A); Initialize (A); Solve (A); end main

begin Solve (A) int i, j, done = 0; float temp; while (!done) diff = 0.0; for i = 0 to n-1 for j = 0 to n-1 temp = A[i,j]; A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i1,j]+A[i+1,j]); diff += fabs (A[i,j] - temp); endfor endfor if (diff/(n*n) < TOL) then done = 1; endwhile end Solve


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" Parallel Programming Decomposition of Iterative Equation Solver Assignment Shared memory version Mutual exclusion LOCK optimization More synchronization Message passing Major changes Message passing Message Passing Grid Solver MPI-like environment [From Chapter 2 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" Decomposition of Iterative Equation SolverLook for concurrency in loop iterations In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)

Each anti-diagonal can be computed in parallel Must synchronize after each anti-diagonal (or pt-to-pt) Alternative: red-black ordering (different update pattern) Can update all red points first, synchronize globally with a barrier and then update all black points May converge faster or slower compared to sequential program Converged equilibrium may also be different if there are multiple solutions Ocean simulation uses this decomposition We will ignore the loop-carried dependence and go ahead with a straight-forward loop decomposition Allow updates to all points in parallel This is yet another different update order and may affect convergence Update to a point may or may not see the new updates to the nearest neighbors (this parallel algorithm is non-deterministic) while (!done) diff = 0.0; for_all i = 0 to n-1 for_all j = 0 to n-1 temp = A[i, j]; A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j]); diff += fabs (A[i, j] temp); end for_all end for_all if (diff/(n*n) < TOL) then done = 1; end while Offers concurrency across elements: degree of concurrency is n 2 Make the j loop sequential to have row-wise decomposition: degree n concurrency


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" AssignmentPossible static assignment: block row decomposition Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/p to (2n/p)-1 etc. Another static assignment: cyclic row decomposition Process 0 gets rows 0, p, 2p,; process 1 gets rows 1, p+1, 2p+1,. Dynamic assignment Grab next available row, work on that, grab a new row, Static block row assignment minimizes nearest neighbor communication by assigning contiguous rows to the same process

Shared memory version/* include files */ MAIN_ENV; int P, n; void Solve (); struct gm_t { LOCKDEC (diff_lock); BARDEC (barrier); float **A, diff; } *gm; int main (char **argv, int argc) { int i; MAIN_INITENV; gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t)); LOCKINIT (gm->diff_lock); BARINIT (gm->barrier); n = atoi (argv[1]); P = atoi (argv[2]); gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*)); for (i = 0; i < n+2; i++) { gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float)); } Initialize (gm->A); for (i = 1; i < P; i++) { /* starts at 1 */ CREATE (Solve); } Solve (); WAIT_FOR_END (P-1); MAIN_END; } void Solve (void) { int i, j, pid, done = 0;


Lawla10do

Objectives_template

float temp, local_diff; GET_PID (pid); while (!done) { local_diff = 0.0; if (!pid) gm->diff = 0.0; BARRIER (gm->barrier, P);/*why?*/ for (i = pid*(n/P); i < (pid+1)*(n/P); i++) { for (j = 0; j < n; j++) { temp = gm->A[i] [j]; gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm->A[i+1] [j] + gm->A[i-1] [j]); local_diff += fabs (gm->A[i] [j] temp); } /* end for */ } /* end for */ LOCK (gm->diff_lock); gm->diff += local_diff; UNLOCK (gm->diff_lock); BARRIER (gm->barrier, P); if (gm->diff/(n*n) < TOL) done = 1; BARRIER (gm->barrier, P); /* why? */ } /* end while */ }


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" Mutual exclusionUse LOCK/UNLOCK around critical sections Updates to shared variable diff must be sequential Heavily contended locks may degrade performance Try to minimize the use of critical sections: they are sequential anyway and will limit speedup This is the reason for using a local_diff instead of accessing gm->diff every time Also, minimize the size of critical section because the longer you hold the lock, longer will be the waiting time for other processors at lock acquire

LOCK optimizationSuppose each processor updates a shared variable holding a global cost value, only if its local cost is less than the global cost: found frequently in minimization problems LOCK (gm->cost_lock); if (my_cost < gm->cost) { gm->cost = my_cost; } UNLOCK (gm->cost_lock); /* May lead to heavy lock contention if everyone tries to update at the same time */ if (my_cost < gm->cost) { LOCK (gm->cost_lock); if (my_cost < gm->cost) { /* make sure*/ gm->cost = my_cost; } UNLOCK (gm->cost_lock); } /* this works because gm->cost is monotonically decreasing */

More synchronizationGlobal synchronization Through barriers Often used to separate computation phases Point-to-point synchronization A process directly notifies another about a certain event on which the latter was waiting Producer-consumer communication pattern Semaphores are used for concurrent programming on uniprocessor through P and V functions Normally implemented through flags on shared memory multiprocessors (busy wait or spin) P0 : A = 1; flag = 1; P1 : while (!flag); use (A);


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" Message passingWhat is different from shared memory? No shared variable: expose communication through send/receive No lock or barrier primitive Must implement synchronization through send/receive Grid solver example P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e. P1 waits to receive rows n/P to 2n/P-1 etc. (this is one-time) Within the while loop the first thing that every processor does is to send its first and last rows to the upper and the lower processors (corner cases need to be handled) Then each processor waits to receive the neighboring two rows from the upper and the lower processors At the end of the loop each processor sends its local_diff to P0 and P0 sends back the done flag

Major changes


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 7: "Parallel Programming" Lecture 13: "Parallelizing a Sequential Program" Message passingThis algorithm is deterministic May converge to a different solution compared to the shared memory version if there are multiple solutions: why? There is a fixed specific point in the program (at the beginning of each iteration) when the neighboring rows are communicated This is not true for shared memory

Message Passing Grid SolverMPI-like environment MPI stands for Message Passing Interface A C library that provides a set of message passing primitives (e.g., send, receive, broadcast etc.) to the user PVM (Parallel Virtual Machine) is another well-known platform for message passing programming Background in MPI is not necessary for understanding this lecture Only need to know When you start an MPI program every thread runs the same main function We will assume that we pin one thread to one processor just as we did in shared memory Instead of using the exact MPI syntax we will use some macros that call the MPI functions MAIN_ENV; /* define message tags */ #define ROW 99 #define DIFF 98 #define DONE 97 int main(int argc, char **argv) { int pid, P, done, i, j, N; float tempdiff, local_diff, temp, **A; MAIN_INITENV; GET_PID(pid); GET_NUMPROCS(P); N = atoi(argv[1]); tempdiff = 0.0; done = 0; A = (double **) malloc ((N/P+2) * sizeof(float *)); for (i=0; i < N/P+2; i++) { A[i] = (float *) malloc (sizeof(float) * (N+2)); } initialize(A); while (!done) { local_diff = 0.0; /* MPI_CHAR means raw byte format */


Lawla10do

Objectives_template

if (pid) { /* send my first row up */ SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } if (pid != P-1) { /* recv last row */ RECV(&A[N/P+1][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); } if (pid != P-1) { /* send last row down */ SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); } if (pid) { /* recv first row from above */ RECV(&A[0][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } for (i=1; i >= 1) { flag[pid - mask][MAX-1] = 1; } }


Lawla10do

Objectives_template

Convince yourself that this works Take 8 processors and arrange them on leaves of a tree of depth 3 You will find that only odd nodes move up at every level during acquire (implemented in the first for loop) The even nodes just set the flags (the first statement in the if condition): they bail out of the first loop with mask=1 The release is initiated by the last processor in the last for loop; only odd nodes execute this loop (7 wakes up 3, 5, 6; 5 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0) Each processor will need at most log (P) + 1 flags Avoid false sharing: allocate each processors flags on a separate chunk of cache lines With some memory wastage (possibly worth it) allocate each processors flags on a separate page and map that page locally in that processors physical memory Avoid remote misses in DSM multiprocessor Does not matter in bus-based SMPs


Lawla10do

Objectives_template

Module 11: "Synchronization" Lecture 23: "Barriers and Speculative Synchronization" Hardware supportRead broadcast Possible to reduce the number of bus transactions from P-1 to 1 in the best case A processor seeing a read miss to flag location (possibly from a fellow processor) backs off and does not put its read miss on the bus Every processor picks up the read reply from the bus and the release completes with one bus transaction Needs special hardware/compiler support to recognize these flag addresses and resort to read broadcast

Hardware barrierUseful if frequency of barriers is high Need a couple of wired-AND bus lines: one for odd barriers and one for even barriers A processor arrives at the barrier and asserts its input line and waits for the wired-AND line output to go HIGH Not very flexible: assumes that all processors will always participate in all barriers Bigger problem: what if multiple processes belonging to the same parallel program are assigned to each processor? No SMP supports it today However, possible to provide flexible hardware barrier support in the memory controller of DSM multiprocessors: memory controller can recognize accesses to special barrier counter or barrier flag, combine them in memory and reply to processors only when the barrier is complete (no retry due to failed lock)

Speculative synch.Speculative synchronization Basic idea is to introduce speculation in the execution of critical sections Assume that no other processor will have conflicting data accesses in the critical section and hence dont even try to acquire the lock Just venture into the critical section and start executing Note the difference between this and speculative execution of critical section due to speculation on the branch following SC: there you still contend for the lock generating network transactions Martinez and Torrellas. In ASPLOS 2002. Rajwar and Goodman. In ASPLOS 2002. We will discuss Martinez and Torrellas

Why is it good?In many cases compiler/user inserts synchronization conservatively Hard to know exact access pattern The addresses accessed may depend on input Take a simple example of a hash table When the hash table is updated by two processes you really do not know which bins they will insert into So you conservatively make the hash table access a critical section


Lawla10do

Objectives_template

For certain input values it may happen that the processes could actually update the hash table concurrently


Lawla10do

Objectives_template

Module 11: "Synchronization" Lecture 23: "Barriers and Speculative Synchronization" How does it work?Speculative locks Every processor comes to the critical section and tries to acquire the lock One of them succeeds and the rest fail The successful processor becomes the safe thread The failed ones dont retry but venture into the critical section speculatively as if they have the lock; at this point a speculative thread also takes a checkpoint of its register state in case a rollback is needed The safe thread executes the critical section as usual The speculative threads are allowed to consume values produced by the safe thread but not by the sp. threads All stores from a speculative thread are kept inside its cache hierarchy in a special speculative modified state; these lines cannot be sent to memory until it is known to be safe; if such a line is replaced from cache either it can be kept in a small buffer or the thread can be stalled Speculative locks (continued) If a speculative thread receives a request for a cache line that is in speculative M state, that means there is a data race inside the critical section and by design the receiver thread is rolled back to the beginning of critical section Why cant the requester thread be rolled back? In summary, the safe thread is never squashed and the speculative threads are not squashed if there is no cross-thread data race If a speculative thread finishes executing the critical section without getting squashed, it still must wait for the safe thread to finish the critical section before committing the speculative state (i.e. changing speculative M lines to M); why? Speculative locks (continued) Upon finishing the critical section, a speculative thread can continue executing beyond the CS, but still remaining in speculative mode When the safe thread finishes the CS all speculative threads that have already completed CS, can commit in some non-deterministic order and revert to normal execution The speculative threads that are still inside the critical section remain speculative; a dedicated hardware unit elects one of them the lock owner and that becomes the safe non-speculative thread; the process continues Clearly, under favorable conditions speculative synchronization can reduce lock contention enormously


Lawla10do

Objectives_template

Module 11: "Synchronization" Lecture 23: "Barriers and Speculative Synchronization" Why is it correct?In a non-speculative setting there is no order in which the threads execute the CS Even if there is an order that must be enforced by the program itself In speculative synchronization some threads are considered safe (depends on time of arrival) and there is exactly one safe thread at a time in a CS The speculative threads behave as if they complete the CS in some order after the safe thread(s) A read from a thread (spec. or safe) after a write from another speculative thread to the same cache line triggers a squash It may not be correct to consume the speculative value Same applies to write after write

Performance concernsMaintaining a safe thread guarantees forward progress Otherwise if all were speculative, cross-thread races may repeatedly squash all of them False sharing? What if two bins of a hash table belong to the same cache line? Two threads are really not accessing the same address, but the speculative thread will still suffer from a squash Possible to maintain per-word speculative state

Speculative flags and barriersSpeculative flags are easy to support Just continue past an unset flag in speculative mode The thread that sets the flag is always safe The thread(s) that read the flag will speculate Speculative barriers come for free Barriers use locks and flags However, since the critical section in a barrier accesses a counter, multiple threads venturing into the CS are guaranteed to have conflicts So just speculate on the flag and let the critical section be executed conventionally

Speculative flags and branch predictionP0: A=1; flag=1; P1: while (!flag); print A; Assembly of P1s code Loop: lw register, flag_addr beqz register, Loop What if I pass a hint via the compiler (say, a single bit in each branch instruction) to the branch predictor asking it to always predict not taken for this branch? Isnt it achieving the same effect as speculative flag, but with a much simpler technique? No.


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 24: "Write Serialization in a Simple Design" Multiprocessors on A Snoopy Bus Agenda Correctness goals A simple design Cache controller Snoop logic Writebacks A simple design Inherently non-atomic Write serialization Fetch deadlock Livelock Starvation More on LL/SC Multi-level caches [From Chapter 6 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 24: "Write Serialization in a Simple Design" AgendaGoal is to understand what influences the performance, cost and scalability of SMPs Details of physical design of SMPs At least three goals of any design: correctness, performance, low hardware complexity Performance gains are normally achieved by pipelining memory transactions and having multiple outstanding requests These performance optimizations occasionally introduce new protocol races involving transient states leading to correctness issues in terms of coherence and consistency

Correctness goalsMust enforce coherence and write serialization Recall that write serialization guarantees all writes to a location to be seen in the same order by all processors Must obey the target memory consistency model If sequential consistency is the goal, the system must provide write atomicity and detect write completion correctly (write atomicity extends the definition of write serialization for any location i.e. it guarantees that positions of writes within the total order seen by all processors be the same) Must be free of deadlock, livelock and starvation Starvation confined to a part of the system is not as problematic as deadlock and livelock However, system-wide starvation leads to livelock

A simple designStart with a rather nave design Each processor has a single level of data and instruction caches The cache allows exactly one outstanding miss at a time i.e. a cache miss request is blocked if already another is outstanding (this serializes all bus requests from a particular processor) The bus is atomic i.e. it handles one request at a time

Cache controllerMust be able to respond to bus transactions as necessary 1 Handled by the snoop logic The snoop logic should have access to the cache tags A single set of tags cannot allow concurrent accesses by the processor-side and the bus-side controllers When the snoop logic accesses the tags the processor must remain locked out from accessing the tags Possible enhancements: two read ports in the tag RAM allows concurrent reads; duplicate copies are also possible; multiple banks reduce the contention also In all cases, updates to tags must still be atomic or must be applied to both copies in case of duplicate tags; however, tag updates are a lot less frequent compared to reads


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 24: "Write Serialization in a Simple Design" Snoop logicCouple of decisions need to be taken while designing the snoop logic How long should the snoop decision take? How should processors convey the snoop decision? Snoop latency (three design choices) Possible to set an upper bound in terms of number of cycles; advantage: no change in memory controller hardware; disadvantage: potentially large snoop latency (Pentium Pro, Sun Enterprise servers) The memory controller samples the snoop results every cycle until all caches have completed snoop (SGI Challenge uses this approach where the memory controller fetches the line from memory, but stalls if all caches havent yet snooped) Maintain a bit per memory line to indicate if it is in M state in some cache Conveying snoop result For MESI the bus is augmented with three wired-OR snoop result lines (shared, modified, valid); the valid line is active low The original Illinois MESI protocol requires cache-to-cache transfer even when the line is in S state; this may complicate the hardware enormously due to the involved priority mechanism Commercial MESI protocols normally allow cache-to-cache sharing only for lines in M state SGI Challenge and Sun Enterprise allow cache-to-cache transfers only in M state; Challenge updates memory when going from M to S while Enterprise exercises a MOESI protocol

WritebacksWritebacks are essentially eviction of modified lines Caused by a miss mapping to the same cache index Needs two bus transactions: one for the miss and one for the writeback Definitely the miss should be given first priority since this directly impacts forward progress of the program Need a writeback buffer (WBB) to hold the evicted line until the bus can be acquired for the second time by this cache In the meantime a new request from another processor may be launched for the evicted line: the evicting cache must provide the line from the WBB and cancel the pending writeback (need an address comparator with WBB)

A simple design


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 24: "Write Serialization in a Simple Design" Inherently non-atomicEven though the bus is atomic, a complete protocol transaction involves quite a few steps which together forms a non-atomic transaction Issuing processor request Looking up cache tags Arbitrating for bus Snoop action in other cache controller Refill in requesting cache controller at the end Different requests from different processors may be in a different phase of a transaction This makes a protocol transition inherently non-atomic Consider an example P0 and P1 have cache line C in shared state Both proceed to write the line Both cache controllers look up the tags, put a BusUpgr into the bus request queue, and start arbitrating for the bus P1 gets the bus first and launches its BusUpgr P0 observes the BusUpgr and now it must invalidate C in its cache and change the request type to BusRdX So every cache controller needs to do an associative lookup of the snoop address against its pending request queue and depending on the request type take appropriate actions One way to reason about the correctness is to introduce transient states Possible to think of the last problem as the line C being in a transient S M state On observing a BusUpgr or BusRdX, this state transitions to I M which is also transient The line C goes to stable M state only after the transaction completes These transient states are not really encoded in the state bits of a cache line because at any point in time there will be a small number of outstanding requests from a particular processor (today the maximum I know of is 16) These states are really determined by the state of an outstanding line and the state of the cache controller

Write serializationAtomic bus makes it rather easy, but optimizations are possible Consider a processor write to a shared cache line Is it safe to continue with the write and change the state to M even before the bus transaction is complete? After the bus transaction is launched it is totally safe because the bus is atomic and hence the position of the write is committed in the total order; therefore no need to wait any further (note that the exact point in time when the other caches invalidate the line is not important) If the processor decides to proceed even before the bus transaction is launched (very much possible in ooo execution), the cache controller must take the responsibility of squashing and re-executing offending instructions so that the total order is consistent across the system


Lawla10do

Objectives_template

Fetch deadlockJust a fancy name for a pretty intuitive deadlock Suppose P0s cache controller is waiting to get the bus for launching a BusRdX to cache line A P1 has a modified copy of cache line A P1 has launched a BusRd to cache line B and awaiting completion P0 has a modified copy of cache line B If both keep on waiting without responding to snoop requests, the deadlock cycle is pretty obvious So every controller must continue to respond to snoop requests while waiting for the bus for its own requests Normally the cache controller is designed as two separate independent logic units, namely, the inbound unit (handles snoop requests) and the outbound unit (handles own requests and arbitrates for bus)


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 24: "Write Serialization in a Simple Design" LivelockConsider the following example P0 and P1 try to write to the same cache line P0 gets exclusive ownership, fills the line in cache and notifies the load/store unit (or retirement unit) to retry the store While all these are happening P1s request appears on the bus and P0s cache controller modifies tag state to I before the store could retry This can easily lead to a livelock Normally this is avoided by giving the load/store unit higher priority for tag access (i.e. the snoop logic cannot modify the tag arrays when there is a processor access pending in the same clock cycle) This is even rarer in multi-level cache hierarchy (more later)

StarvationSome amount of fairness is necessary in the bus arbiter An FCFS policy is possible for granting bus, but that needs some buffering in the arbiter to hold already placed requests Most machines implement an aging scheme which keeps track of the number of times a particular request is denied and when the count crosses a threshold that request becomes the highest priority (this too needs some storage)

More on LL/SCWe have seen that both LL and SC may suffer from cache misses (a read followed by an upgrade miss) Is it possible to save one transaction? What if I design my cache controller in such a way that it can recognize LL instructions and launch a BusRdX instead of BusRd? This is called Read-for-Ownership (RFO); also used by Intel atomic xchg instruction Nice idea, but you have to be careful By doing this you have just enormously increased the probability of a livelock: before the SC executes there is a high probability that another LL will take away the line Possible solution is to buffer incoming snoop requests until the SC completes (buffer space is proportional to P); may introduce new deadlock cycles (especially for modern non-atomic busses)

Multi-level cachesWe have talked about multi-level caches and the involved inclusion property Multiprocessors create new problems related to multi-level caches A bus snoop result may be relevant to inner levels of cache e.g., bus transactions are not visible to the first level cache controller Similarly, modifications made in the first level cache may not be visible to the second level cache controller which is responsible for handling bus requests Inclusion property makes it easier to maintain coherence Since L1 cache is a subset of L2 cache a snoop miss in L2 cache need not be sent to L1 cache


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses"

Recap of inclusion Inclusion and snoop L2 to L1 interventions Invalidation acks? Intervention races Tag RAM design Exclusive cache levels Split-transaction bus New issues SGI Powerpath-2 bus Bus interface logic Snoop results [From Chapter 6 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses" Recap of inclusionA processor read Looks up L1 first and in case of miss goes to L2, and finally may need to launch a BusRd request if it misses in L2 Finally, the line is in S state in both L1 and L2 A processor write Looks up L1 first and if it is in I state sends a ReadX request to L2 which may have the line in M state In case of L2 hit, the line is filled in M state in L1 In case of L2 miss, if the line is in S state in L2 it launches BusUpgr; otherwise it launches BusRdX; finally, the line is in state M in both L1 and L2 If the line is in S state in L1, it sends an upgrade request to L2 and either there is an L2 hit or L2 just conveys the upgrade to bus (Why cant it get changed to BusRdX?) L1 cache replacement Replacement of a line in S state may or may not be conveyed to L2 Replacement of a line in M state must be sent to L2 so that it can hold the most up-todate copy The line is in I state in L1 after replacement, the state of line remains unchanged in L2 L2 cache replacement Replacement of a line in S state may or may not generate a bus transaction; it must send a notification to the L1 caches so that they can invalidate the line to maintain inclusion Replacement of a line in M state first asks the L1 cache to send all the relevant L1 lines (these are the most up-to-date copies) and then launches a BusWB The state of line in both L1 and L2 is I after replacement Replacement of a line in E state from L1? Replacement of a line in E state from L2? Replacement of a line in O state from L1? Replacement of a line in O state from L2? In summary A line in S state in L2 may or may not be in L1 in S state A line in M state in L2 may or may not be in L1 in M state; Why? Can it be in S state? A line in I state in L2 must not be present in L

Inclusion and snoopBusRd snoop Look up L2 cache tag; if in I state no action; if in S state no action; if in M state assert wired-OR M line, send read intervention to L1 data cache, L1 data cache sends lines back, L2 controller launches line on bus, both L1 and L2 lines go to S state BusRdX snoop Look up L2 cache tag; if in I state no action; if in S state invalidate and also notify L1; if in M state assert wired-OR M line, send readX intervention to L1 data cache, L1 data cache sends lines back, L2 controller launches line on bus, both L1 and L2 lines go to I state BusUpgr snoop Similar to BusRdX without the cache line flush


Lawla10do

Objectives_template

L2 to L1 interventionsTwo types of interventions One is read/readX intervention that requires data reply Other is plain invalidation that does not need data reply Data interventions can be eliminated by making L1 cache write-through But introduces too much of write traffic to L2 One possible solution is to have a store buffer that can handle the stores in background obeying the available BW, so that the processor can proceed independently; this can easily violate sequential consistency unless store buffer also becomes a part of snoop logic Useless invalidations can be eliminated by introducing an inclusion bit in L2 cache state


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses" Invalidation acks?On a BusRdX or BusUpgr in case of a snoop hit in S state L2 cache sends invalidation to L1 caches Does the snoop logic wait for an invalidation acknowledgment from L1 cache before the transaction can be marked complete? Do we need a two-phase mechanism? What are the issues?

Intervention racesWritebacks introduce new races in multi-level cache hierarchy Suppose L2 sends a read intervention to L1 and in the meantime L1 decides to replace that line (due to some conflicting processor access) The intervention will naturally miss the up-to-date copy When the writeback arrives at L2, L2 realizes that the intervention race has occurred (need extra hardware to implement this logic; what hardware?) When the intervention reply arrives from L1, L2 can apply the newly received writeback and launch the line on bus Exactly same situation may arise even in uniprocessor if a dirty replacement from L2 misses the line in L1 because L1 just replaced that line too

Tag RAM designA multi-level cache hierarchy reduces tag contention L1 tags are mostly accessed by the processor because L2 cache acts as a filter for external requests L2 tags are mostly accessed by the system because hopefully L1 cache can absorb most of the processor traffic Still some machines maintain duplicate tags at all or the outermost level only

Exclusive cache levelsAMD K7 (Athlon XP) and K8 (Athlon64, Opteron) architectures chose to have exclusive levels of caches instead of inclusive Definitely provides you much better utilization of on-chip caches since there is no duplication But complicates many issues related to coherence The uniprocessor protocol is to refill requested lines directly into L1 without placing a copy in L2; only on an L1 eviction put the line into L2; on an L1 miss look up L2 and in case of L2 hit replace line from L2 and put it in L1 (may have to replace multiple L1 lines to accommodate the full L2 line; not sure what K8 does: possible to maintain inclusion bit per L1 line sector in L2 cache) For multiprocessors one solution could be to have one snoop engine per cache level and a tournament logic that selects the successful snoop result


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses" Split-transaction busAtomic bus leads to underutilization of bus resources Between the address is taken off the bus and the snoop responses are available the bus stays idle Even after the snoop result is available the bus may remain idle due to high memory access latency Split-transaction bus divides each transaction into two parts: request and response Between the request and response of a particular transaction there may be other requests and/or responses from different transactions Outstanding transactions that have not yet started or have completed only one phase are buffered in the requesting cache controllers

New issuesSplit-transaction bus introduces new protocol races P0 and P1 have a line in S state and both issue BusUpgr, say, in consecutive cycles Snoop response arrives later because it takes time Now both P0 and P1 may think that they have ownership Flow control is important since buffer space is finite In-order or out-of-order response? Out-of-order response may better tolerate variable memory latency by servicing other requests Pentium Pro uses in-order response SGI Challenge and Sun Enterprise use out-of-order response i.e. no ordering is enforced

SGI Powerpath-2 busUsed in SGI Challenge Conflicts are resolved by not allowing multiple bus transactions to the same cache line Allows eight outstanding requests on the bus at any point in time Flow control on buffers is provided by negative acknowledgments (NACKs): the bus has a dedicated NACK line which remains asserted if the buffer holding outstanding transactions is full; a NACKed transaction must be retried The request order determines the total order of memory accesses, but the responses may be delivered in a different order depending on the completion time of them In subsequent slides we call this design Powerpath-2 since it is loosely based on that Logically two separate buses Request bus for launching the command type (BusRd, BusWB etc.) and the involved address Response bus for providing the data response, if any Since responses may arrive in an order different from the request order, a 3-bit tag is assigned to each request Responses launch this tag on the tag bus along with the data reply so that the address bus may be left free for other requests The data bus is 256-bit wide while a cache line is 128 bytes One data response phase needs four bus cycles along with one additional hardware


Lawla10do

Objectives_template

turnaround cycle


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses" SGI Powerpath-2 bus

Essentially two main buses and various control wires for snoop results, flow control etc. Address bus: five cycle arbitration, used during request Data bus: five cycle arbitration, five cycle transfer, used during response Three different transactions may be in one of these three phases at any point in time

Forming a total order After the decode cycle during request phase every cache controller takes appropriate coherence actions i.e. BusRd downgrades M line to S, BusRdX invalidates line If a cache controller does not get the tags due to contention with the processor; it simply lengthens the ack phase beyond one cycle Thus the total order is formed during the request phase itself i.e. the position of each request in the total order is determined at that point BusWB case BusWB only needs the request phase However needs both address and data lines together Must arbitrate for both together BusUpgr case Consists only of the request phase No response or acknowledgment As soon as the ack phase of address arbitration is completed by the issuing node, the upgrade has sealed a position in the total order and hence is marked complete by sending a completion signal to the issuing processor by its local bus controller (each node has its own bus controller to handle bus requests)


Lawla10do

Objectives_template


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 25: "Protocols for Split-transaction Buses" Bus interface logic

A request table entry is freed when the response is observed on the bus Snoop resultsThree snoop wires: shared, modified, inhibit (all wired-OR) The inhibit wire helps in holding off snoop responses until the data response is launched on the bus Although the request phase determines who will source the data i.e. some cache or memory, the memory controller does not know it The cache with a modified copy keeps the inhibit line asserted until it gets the data bus and flushes the data; this prevents memory controller from sourcing the data Otherwise memory controller arbitrates for the data bus When the data appears all cache controllers appropriately assert the shared and modified line Why not launch snoop results as soon as they are available?


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies"

Conflict resolution Path of a cache miss Write serialization Write atomicity and SC Another example In-order response Multi-level caches Dependence graph Multiple outstanding requests SGI Challenge Sun Enterprise Sun Gigaplane bus [From Chapter 6 of Culler, Singh, Gupta]


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies" Conflict resolutionUse the pending request table to resolve conflicts Every processor has a copy of the table Before arbitrating for the address bus every processor looks up the table to see if there is a match In case of a match the request is not issued and is held in a pending buffer Flow control is needed at different levels Essentially need to detect if any buffer is full SGI Challenge uses a separate NACK line for each of address and data phases Before the phases reach the ack cycle any cache controller can assert the NACK line if it runs out of some critical buffer; this invalidates the transaction and the requester must retry (may use back-off and/or priority) Sun Enterprise requires the receiver to generate the retry when it has buffer space (thus only one retry)

Path of a cache missAssume a read miss Look up request table; in case of a match with BusRd just mark the entry indicating that this processor will snoop the response from the bus and that it will also assert the shared line In case of a request table hit with BusRdX the cache controller must hold on to the request until the conflict resolves In case of a request table miss the requester arbitrates for address bus; while arbitrating if a conflicting request arrives, the controller must put a NOP transaction within the slot it is granted and hold on to the request until the conflict resolves Suppose the requester succeeds in putting the request on address/command bus Other cache controllers snoop the request, register it in request table (the requester also does this), take appropriate coherence action within own cache hierarchy, main memory also starts fetching the cache line If a cache holds the line in M state it should source it on bus during response phase; it keeps the inhibit line asserted until it gets the data bus; then it lowers inhibit line and asserts the modified line; at this point the memory controller aborts the data fetch/response and instead fields the line from the data bus for writing back If the memory fetches the line even before the snoop is complete, the inhibit line will not allow the memory controller to launch the data on bus After the inhibit line is lowered depending on the state of the modified line memory cancels the data response If no one has the line in M state, the requester grabs the response from memory A store miss is similar Only difference is that even if a cache has the line in M state, the memory controller does not write the response back Also any pending BusUpgr to the same cache line must be converted to BusReadX

Write serializationIn a split-transaction bus setting, the request table provides sufficient support for write


Lawla10do

Objectives_template

serialization Requests to the same cache line are not allowed to proceed at the same time A read to a line after a write to the same line can be launched only after the write response phase has completed; this guarantees that the read will see the new value A write after a read to the same line can be started only after the read response has completed; this guarantees that the value of the read cannot be altered by the value written


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies" Write atomicity and SCSequential consistency (SC) requires write atomicity i.e. total order of all writes seen by all processors should be identical Since a BusRdX or BusUpgr does not wait until the invalidations are actually applied to the caches, you have to be careful P0: A=1; B=1; P1: print B; print A Under SC (A, B) = (0, 1) is not allowed Suppose to start with P1 has the line containing A in cache, but not the line containing B The stores of P0 queue the invalidation of A in P1s cache controller P1 takes read miss for B, but the response of B is re-ordered by P1s cache controller so that it overtakes the invalidaton (thought it may be better to prioritize reads)

Another example

P0: A=1; print B; P1: B=1; print A;Under SC (A, B) = (0, 0) is not allowed Same problem if P0 executes both instructions first, then P1 executes the write of B (which lets assume generates an upgrade so that it is marked complete as soon as the address arbitration phase finishes), then the upgrade completion is re-ordered with the pending invalidation of A So, the reason these two cases fail is that the new values are made visible before older invalidations are applied One solution is to have a strict FIFO queue between the bus controller and the cache hierarchy But it is sufficient as long as replies do not overtake invalidations; otherwise the bus responses can be re-ordered without violating write atomicity and hence SC (e.g., if there are only read and write responses in the queue, it sometimes may make sense to prioritize read responses)

In-order responseIn-order response can simplify quite a few things in the design The fully associative request table can be replaced by a FIFO queue Conflicting requests where one is a write can actually be allowed now (multiple reads were allowed even before although only the first one actually appears on the bus) Consider a BusRdX followed by a BusRd from two different processors With in-order response it is guaranteed that the BusRdX response will be granted the data bus before the BusRd response (which may not be true for ooo response and hence such a conflict is disallowed) So when the cache controller generating the BusRdX sees the BusRd it only notes that


Lawla10do

Objectives_template

it should source the line for this request after its own write is completed The performance penalty may be huge Essentially because of the memory Consider a situation where three requests are pending to cache lines A, B, C in that order A and B map to the same memory bank while C is in a different bank Although the response for C may be ready long before that of B, it cannot get the bus


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies" Multi-level cachesSplit-transaction bus makes the design of multi-level caches a little more difficult The usual design is to have queues between levels of caches in each direction How do you size the queues? Between processor and L1 one buffer is sufficient (assume one outstanding processor access), L1-to-L2 needs P+1 buffers (why?), L2-toL1 needs P buffers (why?), L1 to processor needs one buffer With smaller buffers there is a possibility of deadlock: suppose the L1-to-L2 and L2-to-L1 have one queue entry each, there is a request in L1-to-L2 queue and there is also an intervention in L2-to-L1 queue; clearly L1 cannot pick up the intervention because it does not have space to put the reply in L1-to-L2 queue while L2 cannot pick up the request because it might need space in L2-to-L1 queue in case of an L2 hit Formalizing the deadlock with dependence graph There are four types of transactions in the cache hierarchy: 1. Processor requests (outbound requests), 2. Responses to processor requests (inbound responses), 3. Interventions (inbound requests), 4. Intervention responses (outbound responses) Processor requests need space in L1-to-L2 queue; responses to processors need space in L2-to-L1 queue; interventions need space in L2-to-L1 queue; intervention responses need space in L1-to-L2 queue Thus a message in L1-to-L2 queue may need space in L2-to-L1 queue (e.g. a processor request generating a response due to L2 hit); also a message in L2-to-L1 queue may need space in L1-to-L2 queue (e.g. an intervention response) This creates a cycle in queue space dependence graph

Dependence graphRepresent a queue by a vertex in the graph Number of vertices = number of queues A directed edge from vertex u to vertex v is present if a message at the head of queue u may generate another message which requires space in queue v In our case we have two queues L2-L1 and L1-L2; the graph is not a DAG, hence deadlock

Multi-level cachesIn summary L2 cache controller refuses to drain L1-to-L2 queue if there is no space in L2-to-L1 queue; this is rather conservative because the message at the head of L1-to-L2 queue may not need space in L2-to-L1 queue e.g., in case of L2 miss or if it is an intervention


Lawla10do

Objectives_template

reply; but after popping the head of L1-to-L2 queue it is impossible to backtrack if the message does need space in L2-to-L1 queue Similarly, L1 cache controller refuses to drain L2-to-L1 queue if there is no space in L1to-L2 queue How do we break this cycle? Observe that responses for processor requests are guaranteed not to generate any more messages and intervention requests do not generate new requests, but can only generate replies Solving the queue deadlock Introduce one more queue in each direction i.e. have a pair of queues in each direction L1-to-L2 processor request queue and L1-to-L2 intervention response queue Similarly, L2-to-L1 intervention request queue and L2-to-L1 processor response queue Now L2 cache controller can serve L1-to-L2 processor request queue as long as there is space in L2-to-L1 processor response queue, but there is no constraint on L1 cache controller for draining L2-to-L1 processor response queue Similarly, L1 cache controller can serve L2-to-L1 intervention request queue as long as there is space in L1-to-L2 intervention response queue, but L1-to-L2 intervention response queue will drain as soon as bus is granted


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies" Dependence graphNow we have four queues Processor request (PR) and intervention reply (IY) are L1 to L2 Processor reply (PY) and intervention request (IR) are L2 to L1

Possible to combine PR and IY into a supernode of the graph and still be cycle-free Leads to one L1 to L2 queue Similarly, possible to combine IR and PY into a supernode Leads to one L2 to L1 queue Cannot do both Leads to cycle as already discussed Bottomline: need at least three queues for two-level cache hierarchy

Multiple outstanding requestsToday all processors allow multiple outstanding cache misses We have already discussed issues related to ooo execution Not much needs to be added on top of that to support multiple outstanding misses For multi-level cache hierarchy the queue depths may be made bigger for performance reasons Various other buffers such as writeback buffer need to be made bigger


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 26: "Case Studies" SGI Challenge

Supports 36 MIPS R4400 (4 per board) or 18 MIPS R8000 (2 per board) A-chip has the address bus interface, request table CC-chip handles coherence through the duplicate set of tags Each D-chip handles 64 bits of data and as a whole 4 D-chips interface to a 256-bit wide data bus

Sun Enterprise

Supports up to 30 UltraSPARC processors 2 processors and 1 GB memory per board Wide 64-byte memory bus and hence two memory cycles to transfer the entire cache line (128 bytes)

Sun Gigaplane busSplit-transaction, 256 bits data, 41 bits address, 83.5 MHz (compare to 47.6 MHz of SGI Powerpath-2) Supports 16 boards 112 outstanding transactions (up to 7 from each board)


Lawla10do

Objectives_template

Snoop result is available 5 cycles after the request phase Memory fetches data speculatively MOESI protocol


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

Special TopicsVirtually indexed caches Virtual indexing TLB coherence TLB shootdown Snooping on a ring Scaling bandwidth AMD Opteron Opteron servers AMD Hammer protocol [From Chapter 6 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture27/27_1.htm[6/13/2012 12:00:41 PM]

Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 27: "Scalable Snooping and AMD Hammer Protocol" Virtually indexed cachesRecall that to have concurrent accesses to TLB and cache, L1 caches are often made virtually indexed Can read the physical tag and data while the TLB lookup takes place Later compare the tag for hit/miss detection How does it impact the functioning of coherence protocols and snoop logic? Even for uniprocessor the synonym problem Two different virtual addresses may map to the same physical page frame One simple solution may be to flush all cache lines mapped to a page frame at the time of replacement But this clearly prevents page sharing between two processes

Virtual indexingSoftware normally employs page coloring to solve the synonym issue Allow two virtual pages to point to the same physical page frame only if the two virtual addresses have at least lower k bits common where k is equal to cache line block offset plus log2 (number of cache sets) This guarantees that in a virtually indexed cache, lines from both pages will map to the same index range What about the snoop logic? Putting virtual address on the bus requires a VA to PA translation in the snoop so that physical tags can be generated (adds extra latency to snoop and also requires duplicate set of translations) Putting physical address on the bus requires a reverse translation to generate the virtual index (requires an inverted page table) Dual tags (Goodman, 1987) Hardware solution to avoid synonyms in shared memory Maintain virtual and physical tags; each corresponding tag pair points to each other Assume no page coloring Use virtual address to look up cache (i.e. virtual index and virtual tag) from processor side; if it hits everything is fine; if it misses use the physical address to look up the physical tag and if it hits follow the physical tag to virtual tag pointer to find the index If virtual tag misses and physical tag hits, that means the synonym problem has happened i.e. two different VAs are mapped to the same PA; in this case invalidate the cache line pointed to by physical tag, replace the line at the virtual index of the current virtual address, place the contents of the invalidated line there and update the physical tag pointer to point to the new virtual index Goodman, 1987 Always use physical address for snooping Obviates the need for a TLB in memory controller The physical tag is used to look up the cache for snoop decision In case of a snoop hit the pointer stored with the physical tag is followed to get the virtual index and then the cache block can be accessed if needed (e.g., in M state) Note that even if there are two different types of tags the state of a cache line is the same and does not depend on what type of tag is used to access the line Multi-level cache hierarchy


Lawla10do

Objectives_template

Normally the L1 cache is designed to be virtually indexed and other levels are physically indexed L2 sends interventions to L1 by communicating the PA L1 must determine the virtual index from that to access the cache: dual tags are sufficient for this purpose


Lawla10do

Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus" Lecture 27: "Scalable Snooping and AMD Hammer Protocol" TLB coherenceA page table entry (PTE) may be held in multiple processors in shared memory because all of them access the same shared page A PTE may get modified when the page is swapped out and/or access permissions are changed Must tell all processors having this PTE to invalidate How to do it efficiently? No TLB: virtually indexed virtually tagged L1 caches On L1 miss directly access PTE in memory and bring it to cache; then use normal cache coherence because the PTEs also reside in the shared memory segment On page replacement the page fault handler can flush the cache line containing the replaced PTE Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha 21264 instruction cache) Hardware solution Extend snoop logic to handle TLB coherence PowerPC family exercises a tlbie instruction (TLB invalidate entry) When OS modifies a PTE it puts a tlbie instruction on bus Snoop logic picks it up and invalidates the TLB entry if present in all processors This is well suited for bus-based SMPs, but not for DSMs because broadcast in a large-scale machine is not good

TLB shootdownPopular TLB coherence solution Invoked by an initiator (the processor which modifies the PTE) by sending interrupt

parallel computer iit kanpur

Documents