microthreaded model and drisc processors managing concurrency dynamically a seminar given to ifip...

Microthreaded model and DRISC processors

Managing concurrency dynamically

A seminar given to IFIP 10.3 on 9/5/2007

Chris JesshopeProfessor of Computer Systems Engineering

University of Amsterdam

[email protected]

Background - 10 years of research

This work started in 1996 as a latency-tolerant processor architecture called DRISC * designed for executing data-parallel languages on multiprocessors

It has evolved over 10 years into a self-similar concurrency model called SVP - or Microthreading + with implementations at the ISA and system level

* A Bolychevsky, C R Jesshope and V B Muchnick, (1996) Dynamic scheduling in RISC architectures, IEE Trans. E, Computers and Digital Techniques, 143, pp 309-317+ C R Jesshope (2006) Microthreading - a model for distributed instruction-level concurrency, Parallel processing Letters, 16(2), pp209-228- C R Jesshope (2007) A model for the design and programming of multicores, submitted to “advances in Parallel Computing” L. Grandinetti (Ed.), IOS Press, Amsterdam, http://staff.science.uva.nl/~jesshope/papers/Multicores.pdf

Current and proposed projects

The NWO Microgrids project model is evaluating homogeneous reconfigurable multi-cores based on microthreaded microprocessors 4 years from 01/09/05

SVP has been adopted in the EU AETHER project as a model for self-adaptive computation based on FPGAs 3 years from 01/01/06

The APPLE-CORE FP7 proposal will target C and SAC languages to SVP and will implement prototypes of microthreaded microprocessors (we hope)

UvA’s multi-core mission

Managing 102 - 105 processors per chip Operands from large distributed register files Processors tolerant to significant latency

hundreds of processor cycles

On-chip COMA distributed shared memory Support for a range of architectural paradigms

homogeneous / heterogeneous / FPGA / SIMD

To do all of this we need a programming model supporting concurrency as a core concept

Programming models

Sequential programming has advantages1. sequential programs are deterministic and safely composable - i.e.

using the well understood concept of hierarchy (calling functions)

2. source code is universally compatible and can be compiled to any sequential ISA without modification

3. binary-code compatibility is important in commodity processors - although this is not scalable in current processors

Our aim is to gain the same benefits from a concurrent programming model for multi-cores

Microthread or SVP model

Blocking threads with

data-driven instruction execution

Concurrency trees - hierarchichal composition

Concurrent composition - build programs concurrently nodes represent threads - leaf nodes perform computation branching at nodes represent concurrent subordinate threads

Program A Program B Program A||B

May have dependencies but only between nodes at one level

Blocking threads

Threads at different levels run concurrently A creates Bi for all i in some set dependencies defined between threads A continues until a sync

The identifiable events are: when A creates B when A writes a value used by B etc. when Bi completes for all i

A

A

B0 Bn…

Bn

createcreate

B0

Dependencychain

Dependencychain

sync

barrier sync

What does this mean?

Terminology and concepts

Family of threads All threads at one level

Unit of work a sub-tree i.e. all of a

thread’s subordinate threads

may be considered as a job or a task

Place where a unit of work

executes - one or more processors FPGA cells etc.

Safe composition

A family of threads is created dynamically as an ordered set defined on an index sequence each thread in the family has access to a unique value in the

index sequence - its index in the family

Restrictions are placed on the communication between threads - these are blocking reads the creating thread may write to the first thread in index

sequence; and any created thread may write to the thread whose index is

next in sequence to its own

Communication in a family is acyclic and deadlock cannot be induced by composition - i.e. one thread creating a subordinate family of threads

Subject to resources

Thread distribution

A create operation distributes a parameterised family of threads to processing resources - deterministically

the number of threads & processors is defined at runtime Processors may be one or more homogeneous processors a

dedicated unit configured FPGA cells etc. Communication deadlock is avoided but resource

deadlock can occur have a finite set of registers for synchronising contexts this can be statically analysed for some programs… but not for unbounded recursion of creates - solved by

delegating a unit of work to a new place

Registers as synchronisers

Efficient* implementations of microthreads synchronise in shared registers (as i-structures) avoids a memory round-trip latency in synchronising single-cycle synchronisation is possible

Families of threads communicate and synchronise on shared memory a family’s output to memory is not defined until the

family completes (the synchronising event) i.e. a bulk synchronisation or barrier

* focus on direct implementations of the model at the level of ISA instructions

Putting it all together

……Family of threads - Family of threads - indexed and dynamically defined and dynamically defined

Synchronised dependencies between threads

…… ……

……Threads create subordinate families Threads create subordinate families

i = 0 2 4 6…i = 0 2 4 6…

……M

P

M

P

M

P

M

P

M

P

Units of work delegated to places (resources)

Concurrency managementKill or Squeeze a unit of work

Squeeze is a preemption or retraction of a concurrent unit of work

System issues

Threads are dynamic share memory and can be executed anywhere

Shared and/or distributed memory implementations are possible A place can be on the same chip or on another

Deterministic distribution of families can be used to optimise data locality

Implementation of SVPin conventional processors

Dynamic RISC - DRISC

DRISC processor

Can apply microthreading to any base ISA just add concurrency control instructions provide control structures for threads and families provide a large synchronising register file

Have backward compatibility to the base ISA old binaries run as a threads under the model

New binaries are schedule invariant use from 1 to Nmax number of processors

Synchronous vs Asynchronous register updates

An instruction in a microthreaded pipeline updates registers either: synchronously in when the register is set at the writeback

stage of the pipeline asynchronously in when the register is set to empty at the

writeback stage and some activity concurrent to the pipeline’s operation will write a value to the register file asynchronously

Some instructions do one or the other depending on machine state, e.g. load word depends on L1 cache hit

Regular ISA + concurrency control

Add just five new instructions:

cre - creates a family of microthreads - this is asynchronous and may set more than one register the events are when the family is identified and completes a Thread Control Block (TCB) in memory contains parameters

brk - terminates the family of the executing thread a return value is read from the first register specifier

kill & sqze - terminate & preempt a family specified by a family id the family identifier is read from the second register specifier

DRISC pipeline

2. Instructions issued from the head of the active queue and read synchronising memory

3. If data is available it is sent for processing…otherwise the thread suspends on the empty register

4. Suspended threads are rescheduled when data is written and re-execute the blocked instruction

Thread Thread instructioninstruction

bufferbuffer

QueueQueueOfOf

ActiveActivethreadsthreads

Note the potential for power efficiency…a) If a thread is inactive its TIB line is turned off b) If the queue is empty the processor turns offc) The queue length measures local load and

can be used to adjust the local clock rates

SynchronisingSynchronisingmemorymemory

Fixed delayFixed delayoperationsoperations

Variable delayVariable delayOperations (e.g. loads)Operations (e.g. loads)

instructions

data

1. Threads created one per clock period with a context in synchronising memory

Processor control structures required

A large synchronising register file (RF) also a register-file map for register allocation

A thread table (TT) to store a thread’s state PC; RF base addresses; queue link field; etc.

A thread instruction buffer (TIB) an active thread is associated with a line in the TIB

A family table (FT) to store family information Thread and family identifiers are indices into TT and FT

respectively - i.e. they are direct access structures

Do not require branch predictors large data caches or complex issue logic

Synchronising memory

Registers provide the synchronising memory in a microthreaded pipeline

The state of a register is stored with its data and ports adapt according to that state

In state T-cont the register contains a TT address

In state RR-cont the register contains a remote RF address

empty

full

datawrite

T-cont

Local read no data

RR-cont

Remoteread no data

data writereschedules thread

data writecompletes

remote-read

asynchronous pipeline operations

initialisation

Memory references

To provide latency tolerance loads and stores are decoupled from the pipeline’s operation n.b the datapath cache may be very small e.g. 1KByte

The ISA’s load instruction is: synchronous on L1 D-cache hit asynchronous on L1 D-cache miss

In the latter case the target register is written empty by the pipeline and overwritten asynchronously by the memory subsystem when it provides data

Register-to-register operations

Single-cycle operations are synchronous and scheduled every clock cycle using bypassing

Multi-cycle operations can be either synchronous or asynchronous

Variable-cycle operations are scheduled asynchronously (e.g. shared FPU) the writeback sets the register empty and any

dependent instruction is blocked

Sharing registers between threads

Each thread has an identified context in the register file (≤ 31 registers + R31==0 with Alpha ISA) registers are shared between threads’ contexts to support the

distributed-shared register file - sharing is restricted on the same processor sharing is performed by mapping on adjacent processors sharing is performed by local

communication

Have sub-classes of variables managed in the context global - to all threads in a family local - to one thread only shared/dependent - written by one thread read by its neighbour

Locals

Locals Locals Locals

Local shared Local shared Local shared

Neighbour’s shared

Creating thread

Thread 1 Thread 2 Thread n

Global scalars

Neighbour’s shared

LocalsNeighbour’s

shared

read only read/write

≤ 31

Create

Create performs the following actions autonomously1. writes TCB address to the create buffer at execute stage

2. sets two targets (e.g. Ra and Ra+1) to empty at WB stage3. when the family is a allocated an FT slot it (optionally) it

writes the fid to Ra+1 using the asynchronous port• the family may now be killed or squeezed

4. when the family completes it (optionally) writes the return value to the target specified in the TCB using the asynchronous port

5. finally when the family’s memory writes have completed it writes the return code to Ra using the asynchronous port and cleans itself up - i.e.releasing the FT slot

Squeeze and Kill

kill and squeeze are asynchronous and very powerful! To provide security a pseudo random number is generated by the

processor and kept in the FT and as a part of the fid They require these to match in order to enable the operations

kill and squeeze traverse down through the create tree from the node the signal was sent to for squeeze this is to a user defined level

The concurrency tree is captured implicitly by a parent in the FT i.e. families are located in related FTs that have the same fid as a

parent these children then propagate the signal in turn

Thread state

Threads are held in an indexed table the table index is the thread’s reference and and is used to

build queues on that table

Thread state in the TT is encoded by the queue a thread is currently in empty - not allocated active - head/tail in family table suspended - degenerate queue (head=tail) stored in the

register the thread is suspended on waiting - head/tail in I-cache line

N.b no thread will execute unless its instructions are in cache

Thread state transition

active

suspended

Executes, context switches &reads data successfully

Executes, context switches &Reads data unsuccessfully

waiting

Data written &PC misses I cache

Data written & PC hits I cache

Cache line returns

Microgrids

of microthreaded micropropcessors

Family distribution to clusters

for i = 1, n {…}

create i = 1, n {…}

Source code Binary code

threadqueues

Pipelines +

deterministicglobalschedule

Microthreads scheduled to pipelines dynamically and

instructions executed according to dataflow

constraints

Hardware

schedulers

PP00 PP11 PP22 PP33

i=1i=1 i=4i=4 i=7i=7 i=10i=10i=2i=2 i=5i=5 i=11i=11i=8i=8i=3i=3 i=6i=6 i=9i=9 i=12i=12

i=1,ni=1,n

register-sharing ring network

SEP - dynamic processor allocation

The microgrid concept defines a pool of bare processors allocated dynamically by the SEP to threads at any level in the concurrency tree in order to delegate units of work clusters of processors is configured to a ring and is known as

a place and identified by the address of the rtoot processor microthreaded binary code can be executed anywhere and

on any number of processors

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster5

Cluster4

Cluster4

µTproc

µTproc

µTproc

µTproc

Cluster4

Cluster4

Cluster3

Cluster3

Cluster3

µTproc

Cluster2

Cluster2

Cluster3

Cluster3

Cluster3

µTproc

SEPCluster

1µT

procµT

procµT

procµT

proc

Delegation across CMP

Coherent shared memoryCoherent shared memory

request

responseresponse

Cluster6

Cluster6

delegatedelegate

Configuration switches

Example Chip Architecture

Pipe 2

Data-diffusionmemory

Pipe 1

Pipe 3Pipe 0

FPU

Pipes

Coherency network (64 bytes wide ring / ring of rings)

Register-sharing network (8 bytes wide ring)

Delegation network (1 bit wide grid)

Level 0 tile Level 1 tile

The big picture - where are we?

FPGA MicrothreadedFPGA Microthreadedprocessorprocessor

µTC to ISAµTC to ISAAlphaAlpha + ISA + ISAµTµT

C to µTCC to µTC SAC to µTCSAC to µTC Snet to µTCSnet to µTC

exist today

in development

Microthreaded CMPMicrothreaded CMPsimulator/emulatorsimulator/emulator

ISAISAAlphaAlpha + ISA + ISAµTµT

assember/loaderassember/loader

hand-assembled kernels

to be developed

Sequential Data parallel Streaming

Discussion

Microthreading provides a unified model of concurrency on a scale from CMPs to grids

The model is composed concurrently with restrictions to allow safe composition

It reflects future silicon implementations problems

We have developed a language µTC that captures this concurrency

Conclusions

Microthreaded processors are both computationally and power efficient code is schedule invariant and dynamically distributed instructions are dynamically interleaved

Control structures are distributed and scalable Small compared to an FPU

Can manage code fragments (threads) as small as a few instructions context switch - signal - reschedule a thread on every clock

cycle

microthreaded model and drisc processors managing concurrency dynamically a seminar given to ifip...

Documents

microthreaded model

concurrent programming

level slide

svp model blocking threads

blocking threads threads

core concept slide

selfsimilar concurrency

threads leaf nodes