November 18, 2005 PACL and ASC Processor Research Overview
1
Research Overview Parallel and Associative Computing Groupand theASC Processor Group
Kent State University
Dr. Johnnie Baker, Dr. Robert Walker, and Dr. Jerry Potter (Emeritus), Michael Scherger, Wittaya Chantamas, Hong Wang
Sabegh Singh Virdi, Shannon Steinfadt, Kevin Schaffer
Department of Computer Science
Kent State University
Kent, Ohio
PACL and ASC Processor Research Overview
2November 18, 2005
Associative Models of Computation
Parallel RuntimeEnvironments
Parallel and AssociativeSystem Software
Parallel and AssociativeApplications
Associative and Parallel Algorithms
Parallel and AssociativeResearch Group
ASC ProcessorResearch Group
FPGA-BasedASC Processor
MASCProcessor
Structure Codes,ASC-centric
Implementations
Pipelined ASCw/ Reconfigurable
Network
MultithreadedASC Processor
PACL and ASC Processor Research Overview
3November 18, 2005
Presentation Outline
Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model
Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-
Worker Paradigm – Wittaya Chantamas
ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE
Network to Support MASC – Hong Wang
PACL and ASC Processor Research Overview
4November 18, 2005
Presentation Outline
Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model
Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-
Worker Paradigm – Wittaya Chantamas
ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE
Network to Support MASC – Hong Wang
PACL and ASC Processor Research Overview
5November 18, 2005
Associative Models of Computation Associative Computer: A SIMD computer with
certain additional hardware features. Features can be supported (less efficiently) in
software by a traditional SIMD The name “associative” is due to its ability to locate
items in the memory of PEs by content rather than location.
Uses associative features to simulate an associative memory
The ASC model (for ASsociative Computing) identifies the properties assumed for an associative computer.
PACL and ASC Processor Research Overview
6November 18, 2005
The Associative Computing (ASC) Model
Instruction Stream
Cel
l Net
wor
k
Broadcast / R
eduction Netw
ork
. . .
PEMemory
PEMemory
PEMemory
PEMemory
PEMemory
PEMemory
cells
PACL and ASC Processor Research Overview
7November 18, 2005
Associative Properties of the ASC Model
Broadcast data in constant time Constant time global reduction of
Boolean values using AND/OR Integer values using MAX/MIN
Constant time associative search Responder processing
An IS can detect if a data test is satisfied by any of its cells in constant time (i.e., any-responders)
An IS can select one arbitrary responder in constant time (i.e., pick-one)
Above properties supported in hardware with broadcast and reduction networks
References: M. Jin, J. Baker, and K. Batcher, Timings of Associative Operations on the MASC
model, Workshop of Massively Parallel Processing, IPDPS ’01.
PACL and ASC Processor Research Overview
8November 18, 2005
The MASC Model
Instruction Stream
Instruction Stream
Instruction Stream
Cel
l Net
wor
kInstruction S
tream N
etwork
Broadcast / R
eduction Netw
ork
. . .
PEMemory
PEMemory
PEMemory
PEMemory
PEMemory
PEMemory
. . .
PACL and ASC Processor Research Overview
9November 18, 2005
The MASC Model
MASC (i.e., Multiple ASC) is a multiple ASC model Multiple SIMD model with more than one Instruction
Stream (IS) Each IS can execute a separate data-parallel task
These threads execute to completion without interacting or interruption
Dynamically reconfigurable Each cell listens to only one IS Cells can switch ISs, based on a data test. Cells can switch between being active, inactive, or idle
Each IS with its cells satisfy the ASC model Job/functional parallelism is used to control the ISs
PACL and ASC Processor Research Overview
10November 18, 2005
WEBSITE FOR PAPERS
http://www.cs.kent.edu/~parallel
Follow pointer to “papers”
PACL and ASC Processor Research Overview
11November 18, 2005
Presentation Outline
Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model
Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-
Worker Paradigm – Wittaya Chantamas
ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE
Network to Support MASC – Hong Wang
PACL and ASC Processor Research Overview
12November 18, 2005
MASC Runtime Environment
Designed extensions to the existing ASC instruction set to support multiple instruction streams ISGEN compiler extension Reference: Scherger, Michael, Jerry Potter, and Johnnie Baker,
“Multiple Instruction Stream Control for an Associative Model of Parallel Computation", Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), April 2003.
Developed a prototype MASC runtime environment using a cluster (proof of concept for multiple instruction streams)
PACL and ASC Processor Research Overview
13November 18, 2005
Parallel if-then-else with Instruction Stream Commands
MI_REGION_BEGIN A
if( parallel conditional expression) then (parallel conditional expression)
MI_BEGIN A0
<body_1> /* 15 instructions */ <body_1>
MI_END A0
else
MI_BEGIN A1
<body_2> /* 10 instructions */ <body_2>
MI_END A1
endif; MI_REGION_END A
PACL and ASC Processor Research Overview
14November 18, 2005
Shape Example
Circle ?
Rectangle ?
Triangle ?
Compute Area of Circle
Compute Area of
Rectangle
Compute Area of Triangle
PACL and ASC Processor Research Overview
15November 18, 2005
Runtime Environment
Circle ?
Rectangle ?
Triangle ?
Compute Area of Circle
Compute Area of
Rectangle
Compute Area of Triangle
MI_BEGIN A0
MI_BEGIN A1-B0
MI_BEGIN A1-B1-C0
MI_END A0MI_END A1-B0
MI_REGION_END A
MI_REGION_BEGIN A
MI_BEGIN A1MI_REGION_BEGIN B
MI_REGION_END BMI_END A1
MI_END A1-B1-C0
MI_REGION_END CMI_END A1-B1
MI_BEGIN A1-B1MI_REGION_BEGIN C
IS 0 IS 1compare shape == circlenonresponders -> IS 1compute circle areanoopnoopnoopnoop
noopcompare shape == rectnon-responders -> IS 2compute rectangle areanoopnooplisten IS 0
IS 2noopnoopnoopcompare shape == trianglecompute triangle arealisten IS 1
MI_REGION_BEGIN A compare if shape is a circle
MI_BEGIN A0 5 instructions to compute area of a circle
MI_END A0 MI_BEGIN A1 MI_REGION_BEGIN B
compare if shape is a rectangle MI_BEGIN A1-B0
3 instructions to compute area of a rectangle MI_END A1-B0 MI_BEGIN A1-B1 MI_REGION_BEGIN C
compare if shape is a triangle MI_BEGIN A1-B1-C0
5 instructions to compute area of a triangle MI_END A1-B1-C0 MI_REGION_END C MI_END A1-B1 MI_REGION_END B MI_END A1 MI_REGION_END A
PACL and ASC Processor Research Overview
16November 18, 2005
Presentation Outline
Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model
Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-
Worker Paradigm – Wittaya Chantamas
ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE
Network to Support MASC – Hong Wang
PACL and ASC Processor Research Overview
Outline
A review of MASC Computational Model using manager/worker paradigm and work pool of tasks
Design and implementation of MASC back-end compiler for ASC language (an on going project)
An overview of the MASC emulator (the next project)
PACL and ASC Processor Research Overview
MASC Computational Model
Two types of ISs one manager IS
fork and join tasks manage work pool
a few worker ISs execute tasks
A work pool of tasks
Manager-ISID 0
Worker-IS ID 1
Worker-IS ID 2
Broadcast/Reduction Networks
CELL
CELL
CELL
CELL
CELL
CELL
CELL
Instruction Stream Network
Cell Network
...
PACL and ASC Processor Research Overview
Outline
A review of MASC Computational Model using manager/worker paradigm and work pool of tasks
Design and implementation of MASC back-end compiler for ASC language (an on going project)
An overview of the MASC emulator (the next project)
PACL and ASC Processor Research Overview
MASC Directive
Concurrent data parallel executions of different paths in a branch can be achieved by using the directive
/* .masc fork */ A user has a tight control
Not all different paths in branches will be executed concurrently Only those in branches with directives will
Considered as a comment by the ASC compiler (will show in .lst file, not show in .iob file)
No need for a new ASC compiler in order to run an ASC program in MASC system
PACL and ASC Processor Research Overview
main testint parallel b[$], c[$], d[$];logical parallel BCD[$];associate b[$], c[$], d[$] with BCD[$];
read b[$] c[$] d[$] in BCD[$];b[$] = c[$] + 2;c[$] = d[$] - 3;
/* will be no fork here */if (b[$] .lt. c[$]) then
b[$] = c[$];d[$] = 4;
else c[$] = b[$];
b[$] = d[$];endif;c[$] = d[$];d[$] = c[$];
end;
M100 0000
W110 0000
M111 0000
M1000000
W1100000
a structure a structure codecode
.MI_BEGIN W1100000.MI_BEGIN W1100000beg_of_stmt 1c00 6 0 beg_of_stmt 1c00 6 0 beg_read 5a00 SYSOT beg_read 5a00 SYSOT BCD B,C,D, BCD B,C,D, …… beg_of_stmt 1c00 20 0 beg_of_stmt 1c00 20 0 mvpa_ 4812 C Dmvpa_ 4812 C D.MI_END W1100000.MI_END W1100000
M1110000
PACL and ASC Processor Research Overview
main testint parallel b[$], c[$], d[$];logical parallel BCD[$];associate b[$], c[$], d[$] with BCD[$];
read b[$] c[$] d[$] in BCD[$];b[$] = c[$] + 2;c[$] = d[$] - 3;
/*.MASC FORK */if (b[$] .lt. c[$]) then
b[$] = c[$];d[$] = 4;
else c[$] = b[$];
b[$] = d[$];endif;c[$] = d[$];d[$] = c[$];
end;
M100 0000
W110 0000
M111 0000
W111 1000
W111 2000
W111 X100
M111 X110
a structure a structure codecode
.MI_BEGIN W1112000beg_of_stmt 1c00 16 0 beg_of_stmt 1c00 16 0 mvpa_ 4812 B C mvpa_ 4812 B C beg_of_stmt 1c00 17 0 beg_of_stmt 1c00 17 0 mvpa_ 4812 D Bmvpa_ 4812 D B.MI_END W1112000
M1000000
W1100000
W1111000
M1110000
W111X100
M111X110
W1112000
PACL and ASC Processor Research Overview
Outline
A review of MASC Computational Model using manager/worker paradigm and work pool of tasks
Design and implementation of MASC back-end compiler for ASC language (an on going project)
An overview of the MASC emulator (the next project)
PACL and ASC Processor Research Overview
A MASC Emulator A software that emulates exact MASC hardware ’s
behavior on a PC Thus, allows an ASC program to run on a PC
computer as if the program were run on a MASC system
A modified version of the existing ASC emulator with built-in performance monitoring
Manager/worker paradigm and work pool idea will be implemented in the emulator
MASC runtime system
PACL and ASC Processor Research Overview
25November 18, 2005
Presentation Outline
Short Overview of Associative Models The Single Instruction Stream ASC Model The Multiple-Instruction Stream MASC Model
Architectural Modeling and Runtime Environments MASC Runtime Environments – Michael Scherger Supporting Multiple Instruction Streams using the Manager-
Worker Paradigm – Wittaya Chantamas
ASC Processor Design Scalable Pipelined ASC Processor with Reconfigurable PE
Network to Support MASC – Hong Wang
PACL and ASC Processor Research Overview
26November 18, 2005
Outline of Talk
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction
Stream Sample Code Architecture and Sample Execution
Conclusion
PACL and ASC Processor Research Overview
27November 18, 2005
ASC Processor’s Pipelined Architecture
We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs
Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs In the Control Unit
Instruction Fetch (IF) Part of Instruction Decode (ID)
In the Scalar PE (SPE), in each Parallel PE (PPE) Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB)
PACL and ASC Processor Research Overview
28November 18, 2005
ID/EX Latch
EX/MEM Latch
MEM/WB Latch
Data Memory
Register File
IF/ID Latch
InstructionMemory
Decoder
Control Unit (CU)
Sequential PE (SPE)
Parallel PE (PPE) Array
ImmediateData
BroadcastRegister
Data
Pipelined ASC Processor with Reconfigurable Interconnection Network
PACL and ASC Processor Research Overview
29November 18, 2005
Re
gis
ter
File
Da
ta S
witc
h
Co
mp
ara
tor
ID/E
X L
atc
h
Mask
EX
/ME
M L
atc
h
ME
M/W
B L
atc
h
Da
ta M
em
ory
MU
X
Processing Element (PE)
Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise
Top of mask of ‘0’ disables ID/EX Latch
PACL and ASC Processor Research Overview
30November 18, 2005
Pipelined ASC Processor’s Performance
Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs Other 8-bit processor cores implemented on this FPGA / speed
grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz
Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors With the 5-stage pipeline, our ASC Processor can approach a
peak performance of 300 MHz
PACL and ASC Processor Research Overview
31November 18, 2005
Reconfigurable PE Interconnection Network
Our pipelined ASC Processor also has a reconfigurable PE interconnection network
Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via Linear array (currently implemented), or 2D mesh (to be implemented soon)
without the restriction of physical adjacency
Each PE in the PE Array can Choose to stay in the PE interconnection network, or Choose to stay out of the PE interconnection network, so that it is
bypassed by any inter-PE communication
PACL and ASC Processor Research Overview
32November 18, 2005
ID/EX Latch
EX/MEM Latch
MEM/WB Latch
Data Memory
Register File
IF/ID Latch
InstructionMemory
Decoder
Control Unit (CU)
Sequential PE (SPE)
Parallel PE (PPE) Array
ImmediateData
BroadcastRegister
Data
Pipelined ASC Processor with Reconfigurable Interconnection Network
PACL and ASC Processor Research Overview
33November 18, 2005
Data Switch
RegisterFile
RegisterData
(from SPE)
ImmediateData
(from CU)
LeftNeighbor
RightNeighbor
Top ofMask Stack
Comparator &ID/EX Latch
Reconfigurable Network Implementation
Data switch Passes register, broadcast, and immediate data to the PE and to
its two neighbors Routes data from the PE’s neighbors to its EX stage
Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network Will be needed by MASC Processor
PACL and ASC Processor Research Overview
34November 18, 2005
ASC Processor’s Network Performance
Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present Due to the long path from the first PE to the last PE in the PE
array
4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present When the number of PEs is increased to 50, the clock frequency
drops to 22 MHz
In the future we hope to reduce this delay using a pipelined or other multi-hop architecture
PACL and ASC Processor Research Overview
35November 18, 2005
Outline of Talk
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction
Stream Sample Code Architecture and Sample Execution
Conclusion
PACL and ASC Processor Research Overview
36November 18, 2005
IDLE
Task Manager
Task_Allocation
Wait_For_IS
Join
Call_TM
Task_Execution
IDLE
Instruction Stream
PACL and ASC Processor Research Overview
37November 18, 2005
MASC PE Structure
PE
IS_TM_Chooser
IS1 IS2 TM1 TM2
ID Register
PACL and ASC Processor Research Overview
38November 18, 2005
IDLE
Task Manager
Task_Allocation
Wait_For_IS
Join
Call_TM
Task_Execution
IDLE
Instruction Stream
TM ID
IS ID
IS ID
PACL and ASC Processor Research Overview
39November 18, 2005
Assembly Code Example
.
.101 Parallel_Select_Start Mem(110)102 Pcase Condition1 Mem(104)103 Pcase Condition2 Mem(107)104 Case1105 …106 Parallel_Case_End107 Case 2108 …109 Parallel_Case_End110 Parallel_Select_End (note: This does not trigger JOIN, lack of
tasks do)..
PACL and ASC Processor Research Overview
40November 18, 2005
TM0TM1
TM2 IS0 IS1 IS2
Task Managers Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
PACL and ASC Processor Research Overview
41November 18, 2005
TM0TM1
TM2
Task ManagersIS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
Originally All PEs listen to IS0
PACL and ASC Processor Research Overview
42November 18, 2005
TM0
TM1TM2
Task Managers
IS0 IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
When Parallel Select is met, Task Manager takes over PEs
101 Parallel_Select_Start Mem(110)
PACL and ASC Processor Research Overview
43November 18, 2005
TM0
TM1TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM then calls IS0 to perform 1st task
102 Pcase Condition1 Mem(104)
104 Case1105 …
PACL and ASC Processor Research Overview
44November 18, 2005
TM0
TM1TM2
Task Managers
IS0 IS1
IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM then calls IS1 to perform 2nd task
102 Pcase Condition2 Mem(107)
107 Case 2 108 …
102 Pcase Condition1 Mem(104)
104 Case1105 …
PACL and ASC Processor Research Overview
45November 18, 2005
TM0
TM1TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
2nd task finishes and gives control back to TM
107 Case 2 108 … 109 Parallel_Case_End
102 Pcase Condition1 Mem(104)
104 Case1105 …
PACL and ASC Processor Research Overview
46November 18, 2005
TM0
TM1TM2
Task Managers
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
1st task finishes and gives control back to TM
104 Case1105 …106 Parallel_Case_End
PACL and ASC Processor Research Overview
47November 18, 2005
TM0TM1
TM2
Task ManagersIS0
IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
Control is back to the last finished IS which is IS0
110 Parallel_Select_End . .
IS1
PACL and ASC Processor Research Overview
48November 18, 2005
TM0
TM1
TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
IS1 meets a nested parallel select code
PACL and ASC Processor Research Overview
49November 18, 2005
TM0
TM1
TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM1 allocates the two tasks to IS1 and IS2
A = 2
C = AB = A
Common Register
PACL and ASC Processor Research Overview
50November 18, 2005
Conclusion
We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing Performance is improved by adding a 5-stage pipeline, split
between the Control Unit and the PEs Additional functionality is provided by a reconfigurable PE
interconnection network
Future work will include Support for multiple Control Units (in progress) Performance improvement to support more efficient broadcast to
a large number of PEs