microthreaded model and drisc processors managing concurrency dynamically a seminar given to ifip...
TRANSCRIPT
Microthreaded model and DRISC processors
Managing concurrency dynamically
A seminar given to IFIP 10.3 on 9/5/2007
Chris JesshopeProfessor of Computer Systems Engineering
University of Amsterdam
Background - 10 years of research
This work started in 1996 as a latency-tolerant processor architecture called DRISC * designed for executing data-parallel languages on multiprocessors
It has evolved over 10 years into a self-similar concurrency model called SVP - or Microthreading + with implementations at the ISA and system level
* A Bolychevsky, C R Jesshope and V B Muchnick, (1996) Dynamic scheduling in RISC architectures, IEE Trans. E, Computers and Digital Techniques, 143, pp 309-317+ C R Jesshope (2006) Microthreading - a model for distributed instruction-level concurrency, Parallel processing Letters, 16(2), pp209-228- C R Jesshope (2007) A model for the design and programming of multicores, submitted to “advances in Parallel Computing” L. Grandinetti (Ed.), IOS Press, Amsterdam, http://staff.science.uva.nl/~jesshope/papers/Multicores.pdf
Current and proposed projects
The NWO Microgrids project model is evaluating homogeneous reconfigurable multi-cores based on microthreaded microprocessors 4 years from 01/09/05
SVP has been adopted in the EU AETHER project as a model for self-adaptive computation based on FPGAs 3 years from 01/01/06
The APPLE-CORE FP7 proposal will target C and SAC languages to SVP and will implement prototypes of microthreaded microprocessors (we hope)
UvA’s multi-core mission
Managing 102 - 105 processors per chip Operands from large distributed register files Processors tolerant to significant latency
hundreds of processor cycles
On-chip COMA distributed shared memory Support for a range of architectural paradigms
homogeneous / heterogeneous / FPGA / SIMD
To do all of this we need a programming model supporting concurrency as a core concept
Programming models
Sequential programming has advantages1. sequential programs are deterministic and safely composable - i.e.
using the well understood concept of hierarchy (calling functions)
2. source code is universally compatible and can be compiled to any sequential ISA without modification
3. binary-code compatibility is important in commodity processors - although this is not scalable in current processors
Our aim is to gain the same benefits from a concurrent programming model for multi-cores
Microthread or SVP model
Blocking threads with
data-driven instruction execution
Concurrency trees - hierarchichal composition
Concurrent composition - build programs concurrently nodes represent threads - leaf nodes perform computation branching at nodes represent concurrent subordinate threads
Program A Program B Program A||B
May have dependencies but only between nodes at one level
Blocking threads
Threads at different levels run concurrently A creates Bi for all i in some set dependencies defined between threads A continues until a sync
The identifiable events are: when A creates B when A writes a value used by B etc. when Bi completes for all i
A
A
B0 Bn…
Bn
createcreate
B0
Dependencychain
Dependencychain
sync
barrier sync
What does this mean?
Terminology and concepts
Family of threads All threads at one level
Unit of work a sub-tree i.e. all of a
thread’s subordinate threads
may be considered as a job or a task
Place where a unit of work
executes - one or more processors FPGA cells etc.
Safe composition
A family of threads is created dynamically as an ordered set defined on an index sequence each thread in the family has access to a unique value in the
index sequence - its index in the family
Restrictions are placed on the communication between threads - these are blocking reads the creating thread may write to the first thread in index
sequence; and any created thread may write to the thread whose index is
next in sequence to its own
Communication in a family is acyclic and deadlock cannot be induced by composition - i.e. one thread creating a subordinate family of threads
Subject to resources
Thread distribution
A create operation distributes a parameterised family of threads to processing resources - deterministically
the number of threads & processors is defined at runtime Processors may be one or more homogeneous processors a
dedicated unit configured FPGA cells etc. Communication deadlock is avoided but resource
deadlock can occur have a finite set of registers for synchronising contexts this can be statically analysed for some programs… but not for unbounded recursion of creates - solved by
delegating a unit of work to a new place
Registers as synchronisers
Efficient* implementations of microthreads synchronise in shared registers (as i-structures) avoids a memory round-trip latency in synchronising single-cycle synchronisation is possible
Families of threads communicate and synchronise on shared memory a family’s output to memory is not defined until the
family completes (the synchronising event) i.e. a bulk synchronisation or barrier
* focus on direct implementations of the model at the level of ISA instructions
Putting it all together
……Family of threads - Family of threads - indexed and dynamically defined and dynamically defined
Synchronised dependencies between threads
…… ……
……Threads create subordinate families Threads create subordinate families
i = 0 2 4 6…i = 0 2 4 6…
……M
P
M
P
M
P
M
P
M
P
Units of work delegated to places (resources)
Concurrency managementKill or Squeeze a unit of work
Squeeze is a preemption or retraction of a concurrent unit of work
System issues
Threads are dynamic share memory and can be executed anywhere
Shared and/or distributed memory implementations are possible A place can be on the same chip or on another
Deterministic distribution of families can be used to optimise data locality
Implementation of SVPin conventional processors
Dynamic RISC - DRISC
DRISC processor
Can apply microthreading to any base ISA just add concurrency control instructions provide control structures for threads and families provide a large synchronising register file
Have backward compatibility to the base ISA old binaries run as a threads under the model
New binaries are schedule invariant use from 1 to Nmax number of processors
Synchronous vs Asynchronous register updates
An instruction in a microthreaded pipeline updates registers either: synchronously in when the register is set at the writeback
stage of the pipeline asynchronously in when the register is set to empty at the
writeback stage and some activity concurrent to the pipeline’s operation will write a value to the register file asynchronously
Some instructions do one or the other depending on machine state, e.g. load word depends on L1 cache hit
Regular ISA + concurrency control
Add just five new instructions:
cre - creates a family of microthreads - this is asynchronous and may set more than one register the events are when the family is identified and completes a Thread Control Block (TCB) in memory contains parameters
brk - terminates the family of the executing thread a return value is read from the first register specifier
kill & sqze - terminate & preempt a family specified by a family id the family identifier is read from the second register specifier
DRISC pipeline
2. Instructions issued from the head of the active queue and read synchronising memory
3. If data is available it is sent for processing…otherwise the thread suspends on the empty register
4. Suspended threads are rescheduled when data is written and re-execute the blocked instruction
Thread Thread instructioninstruction
bufferbuffer
QueueQueueOfOf
ActiveActivethreadsthreads
Note the potential for power efficiency…a) If a thread is inactive its TIB line is turned off b) If the queue is empty the processor turns offc) The queue length measures local load and
can be used to adjust the local clock rates
SynchronisingSynchronisingmemorymemory
Fixed delayFixed delayoperationsoperations
Variable delayVariable delayOperations (e.g. loads)Operations (e.g. loads)
instructions
data
1. Threads created one per clock period with a context in synchronising memory
Processor control structures required
A large synchronising register file (RF) also a register-file map for register allocation
A thread table (TT) to store a thread’s state PC; RF base addresses; queue link field; etc.
A thread instruction buffer (TIB) an active thread is associated with a line in the TIB
A family table (FT) to store family information Thread and family identifiers are indices into TT and FT
respectively - i.e. they are direct access structures
Do not require branch predictors large data caches or complex issue logic
Synchronising memory
Registers provide the synchronising memory in a microthreaded pipeline
The state of a register is stored with its data and ports adapt according to that state
In state T-cont the register contains a TT address
In state RR-cont the register contains a remote RF address
empty
full
datawrite
T-cont
Local read no data
RR-cont
Remoteread no data
data writereschedules thread
data writecompletes
remote-read
asynchronous pipeline operations
initialisation
Memory references
To provide latency tolerance loads and stores are decoupled from the pipeline’s operation n.b the datapath cache may be very small e.g. 1KByte
The ISA’s load instruction is: synchronous on L1 D-cache hit asynchronous on L1 D-cache miss
In the latter case the target register is written empty by the pipeline and overwritten asynchronously by the memory subsystem when it provides data
Register-to-register operations
Single-cycle operations are synchronous and scheduled every clock cycle using bypassing
Multi-cycle operations can be either synchronous or asynchronous
Variable-cycle operations are scheduled asynchronously (e.g. shared FPU) the writeback sets the register empty and any
dependent instruction is blocked
Sharing registers between threads
Each thread has an identified context in the register file (≤ 31 registers + R31==0 with Alpha ISA) registers are shared between threads’ contexts to support the
distributed-shared register file - sharing is restricted on the same processor sharing is performed by mapping on adjacent processors sharing is performed by local
communication
Have sub-classes of variables managed in the context global - to all threads in a family local - to one thread only shared/dependent - written by one thread read by its neighbour
Locals
Locals Locals Locals
Local shared Local shared Local shared
Neighbour’s shared
Creating thread
Thread 1 Thread 2 Thread n
Global scalars
Neighbour’s shared
LocalsNeighbour’s
shared
read only read/write
≤ 31
Create
Create performs the following actions autonomously1. writes TCB address to the create buffer at execute stage
2. sets two targets (e.g. Ra and Ra+1) to empty at WB stage3. when the family is a allocated an FT slot it (optionally) it
writes the fid to Ra+1 using the asynchronous port• the family may now be killed or squeezed
4. when the family completes it (optionally) writes the return value to the target specified in the TCB using the asynchronous port
5. finally when the family’s memory writes have completed it writes the return code to Ra using the asynchronous port and cleans itself up - i.e.releasing the FT slot
Squeeze and Kill
kill and squeeze are asynchronous and very powerful! To provide security a pseudo random number is generated by the
processor and kept in the FT and as a part of the fid They require these to match in order to enable the operations
kill and squeeze traverse down through the create tree from the node the signal was sent to for squeeze this is to a user defined level
The concurrency tree is captured implicitly by a parent in the FT i.e. families are located in related FTs that have the same fid as a
parent these children then propagate the signal in turn
Thread state
Threads are held in an indexed table the table index is the thread’s reference and and is used to
build queues on that table
Thread state in the TT is encoded by the queue a thread is currently in empty - not allocated active - head/tail in family table suspended - degenerate queue (head=tail) stored in the
register the thread is suspended on waiting - head/tail in I-cache line
N.b no thread will execute unless its instructions are in cache
Thread state transition
active
suspended
Executes, context switches &reads data successfully
Executes, context switches &Reads data unsuccessfully
waiting
Data written &PC misses I cache
Data written & PC hits I cache
Cache line returns
Microgrids
of microthreaded micropropcessors
Family distribution to clusters
for i = 1, n {…}
create i = 1, n {…}
Source code Binary code
threadqueues
Pipelines +
deterministicglobalschedule
Microthreads scheduled to pipelines dynamically and
instructions executed according to dataflow
constraints
Hardware
schedulers
PP00 PP11 PP22 PP33
i=1i=1 i=4i=4 i=7i=7 i=10i=10i=2i=2 i=5i=5 i=11i=11i=8i=8i=3i=3 i=6i=6 i=9i=9 i=12i=12
i=1,ni=1,n
register-sharing ring network
SEP - dynamic processor allocation
The microgrid concept defines a pool of bare processors allocated dynamically by the SEP to threads at any level in the concurrency tree in order to delegate units of work clusters of processors is configured to a ring and is known as
a place and identified by the address of the rtoot processor microthreaded binary code can be executed anywhere and
on any number of processors
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster5
Cluster4
Cluster4
µTproc
µTproc
µTproc
µTproc
Cluster4
Cluster4
Cluster3
Cluster3
Cluster3
µTproc
Cluster2
Cluster2
Cluster3
Cluster3
Cluster3
µTproc
SEPCluster
1µT
procµT
procµT
procµT
proc
Delegation across CMP
Coherent shared memoryCoherent shared memory
request
responseresponse
Cluster6
Cluster6
delegatedelegate
Configuration switches
Example Chip Architecture
Pipe 2
Data-diffusionmemory
Pipe 1
Pipe 3Pipe 0
FPU
Pipes
Coherency network (64 bytes wide ring / ring of rings)
Register-sharing network (8 bytes wide ring)
Delegation network (1 bit wide grid)
Level 0 tile Level 1 tile
The big picture - where are we?
FPGA MicrothreadedFPGA Microthreadedprocessorprocessor
µTC to ISAµTC to ISAAlphaAlpha + ISA + ISAµTµT
C to µTCC to µTC SAC to µTCSAC to µTC Snet to µTCSnet to µTC
exist today
in development
Microthreaded CMPMicrothreaded CMPsimulator/emulatorsimulator/emulator
ISAISAAlphaAlpha + ISA + ISAµTµT
assember/loaderassember/loader
hand-assembled kernels
to be developed
Sequential Data parallel Streaming
Discussion
Microthreading provides a unified model of concurrency on a scale from CMPs to grids
The model is composed concurrently with restrictions to allow safe composition
It reflects future silicon implementations problems
We have developed a language µTC that captures this concurrency
Conclusions
Microthreaded processors are both computationally and power efficient code is schedule invariant and dynamically distributed instructions are dynamically interleaved
Control structures are distributed and scalable Small compared to an FPU
Can manage code fragments (threads) as small as a few instructions context switch - signal - reschedule a thread on every clock
cycle