pier stanislao paolucci - atmel roma and infn roma - shapes hw overview 1 coordinator of the...
TRANSCRIPT
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1
Coordinator of the European Project
Permanent Staff Researcher (part time)
Istituto Nazionale di Fisica Nucleare
Roma – Italy
4 generations of APE Massive Parallel Processors based on custom VLIW DSP
(part time) Technology Director and Founder
ATMEL Roma, corporate design center of ATMEL for Advanced DSP
2 generations of RISC + custom mAgic VLIW DSP MPSoC
SHAPES: a tiled scalable Software Hardware SHAPES: a tiled scalable Software Hardware Architecture Platform for Embedded SystemsArchitecture Platform for Embedded Systems
--HW overviewHW overview
P.S. Paolucci (INFN and Atmel Roma) for the SHAPES ConsortiumP.S. Paolucci (INFN and Atmel Roma) for the SHAPES Consortium
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 2
Deep Sub-micron Architectures…Deep Sub-micron Architectures… ~160 MGate available on a 100 mm2 chip (45nm CMOS, 2008) Increasing GATES/CHIP Design Complexity Management:
embedded processors use a few million gates only, IP reuse possible and needed; WIRING threatens Moore’s law:
Wiring delay increases on new CMOS silicon generations
The full chip cannot be reached in a single clock cycle
Classic monolithic processor architectures do not scale
Locally Synchronous, Globally Asynchronous needed
Communication Centric SW and HW Architecture needed
… PROPOSED SOLUTION: … TILED ARCHITECTURE…BY SIMPLE GEOMETRIC DEMONSTRATION… IF CONSTANT LOGIC COMPLEXITY INSIDE EACH TILE… THEN (LENGTH OF INTRA-TILE WIRES SCALE DOWN AS THE TILE ITSELF… AND SHORT ~ FIRST NEIGHBOURS ON-CHIP AND OFF-CHIP INTER-TILE WIRES)
QUEST OF BEST TILE, ON-CHIP AND OFF-CHIP INTERCONNECT. BUT HOW TO PROGRAM? EXPLICIT PARALLEL PROGRAMMING PARADIGM, and CULTURE NEEDED
POWER DISSIPATION density approaching prohibitive values if higher clock speed used; much better Oper/Watt at moderate clock + parallelism (the human brain parallel architecture performs an excellent job at 50 HZ!... room for improvement)
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 3
Tiled HW ArchitectureTiled HW Architecture
Communication Centric, not Processor Centric
Homogeneous SW interface for on-chip and off-chip scalable connection and I/O
3D first-neighbour Toroidal System Eng. (3DT) for Off-Chip communication
Virtual tunnelling on packed switching NoC (Network on Chip) and off-chip 3DT
Parallelism Aware System SW: Manage memory distribution, capture real time constraints
Explicit parallel programming/Network of Actors
0 1 2 3 4
5
6
7
89101112
15
14
13
0 1 2 3 4
5
6
7
89101112
15
14
13
FPGA
DAC actuator
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
DSPDNP
Multi-Layer BUS
NoC
RISC
POT
3DT Off-chipcommunication
DXM
Tile
ADC sensor
DAC actuator
ADC sensor
ADC/DAC
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 4
HW GLOSSARYHW GLOSSARYAT THE CHIP LEVEL MTC: Multiple Tile Chip (composed of
multiple Tiles) NOC: Network On Chip (connecting Tiles) 3DT: 3 Dim Toroidal Connection (outside
the chip)
FUNDAMENTAL TYPE OF TILE RDT includes:
RISC: (includes on chip memories RDM and RPM) +
DSP(includes on chip memories DDM and DPM)
DNP + DXM (off-chip mem) + POT (e.g. DAC/ADC conv)
POSSIBLE TILE VARIANTS
(subset of RDT) RET := RDT minus DSP DET:= RDT minus RISC DDT:= DET minus DXM
INSIDE THE TILE Multilayer Bus Matrix sustains multiple
simultaneous transfers RISC max one per tile DSP one or more per tile DNP: Distributed Network Processor (always
one per tile) DDM: on-chip Distributed Data Mem (inside
the DSP) DPM: on-chip Distributed Progr Mem (inside
the DSP) DXM: Distributed eXternal Mem Interface
(max one per tile, outside the RISC and DSP) POT: Peripherals On Tile RDM: Risc (tightly coupled) Data Memory RPM: Risc (tightly coupled) Program Memory RCM: Risc Cache Memory
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 5
Different Different Types Types
of of TilesTiles
DSP
DNP
Multi-Layer BUS
NoC
RISC POT
3DT
DXM
DNP
Multi-Layer BUS
NoC
RISC POT
3DT
DXM
DNP
Multi-Layer BUS
NoC
DSP POT
3DT
DXM
RDT: RISC + DSP Elementary Tile
RET: RISC Elementary Tile DET: DSP Elementary Tile
RDT
RET DET
DXM Mem Bus POT Pads
DXM Mem Bus POT Pads
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 6
Distributed Network ProcessorDistributed Network ProcessorDNP: InterfaceDNP: Interface
DNP
BUS Master (to read from intra-tile memories)
BUS Slave (to receive commands from RISC & DSP)
3DT X+ (to forward/receive inter-tile off chip packets)
3DT X-
3DT Y+
3DT Y-
3DT Z+
3DT Z-
NoC (to forward/receive inter-tile on-chip packets)
BUS Master (simultaneous intra-tile memory write)
Collective communication
7
Atmel Roma vs SHAPES and CASTNESS
Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview
mAgicV IP Architecture (Fully C programmable Gigaflops VLIW DSP)
4-address/cycleMultiple DSP Address
GenerationUnit
16 multi-field Address Register File10-float
ops/cycle
8R+8W 128x40Data Register File
System
2-port, 8Kx128-bit, VLIW Program Memory(DPM)
Flow Controller, VLIW Decoder
Instruction Decoder
Condition Generation
Status Register
Program Counter
VLIW Decompressor
6-access/cycleData Memory
System 2x8Kx40
(DDM)
AHB Slave,
e.g.DMA
Target
AHB MST
AHB MasterDMA
Engine
AHB SLVRST, CLOCKSIRQ OUTIRQ INDBG
8
Atmel Roma vs SHAPES and CASTNESS
Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview
Silicon Floorplan
of Diopsis
940:
RISC + mAgicV
VLIW DSP
DSP Reg File
DSP DataMem(DDM)
DSP Prog Mem (DPM)
DSP Logic
ARM926ARMRDM
Peripherals
AMBA Multilayer
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 9
The tile:The tile:
JTAGROM
Bridge
DXM Interface(AHB EBI)
RDM SRAM
PDMA
mAgicV DSPTM JTAG
DSPAHB
Master
4-addr/cycle
Multiple DSP Addr Gen
10-float ops/cycle
16-port 256x40
Data Regs
mAgicVTM DPM2-port
DDM6-access/
cycle
DSPAHB Slave
Slave
ICE
RISC
RCM Instr Cache MMU RCM Data CacheRDM IF BIUI D I D
Master
Multi-layerBus MATRIX
APBDNPAHB
Master
DNPAHB Slave
DNPAHB
Master
DNP
X+
DXM
X-
Y+
Y-
Z+
Z-
C+
NoC(NI)
P E R I P H E R A L S
Diopsis +Diopsis + DNPDNP
10
Atmel Roma vs SHAPES and CASTNESS
Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview
Tile Complexity
mAgicV DSP: 915 Kgates + 1 Mbit Prog Mem + 640 Kbit Data Mem
ARM926 & peripherals <2 equivalent Mgate (including 640 Kbit mem)
Tile Complexity 4230 equivalent Kgate + DNP gate count
- including on chip memories
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 11
Spidergon topology• It’s a family of regular/symmetric topologies
• We look for a complexity/performance trade-off• Low degree (router cost)• Low number of links (wire cost) • Symmetry (homogeneous building blocks; simple routing)• Low diameter (performance)• Good scalability (small network size granularity)
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 12
STNoC key components
Network on Chip is a set of on-chip routers (up to layer 3),
Network Interfaces (NI) (layer 4) and physical Link
NI
router
IP
IP
linkNI
KernelKernelShellShellIP Router Phy Link
Ph
yL
ink
Ph
yL
ink
Phy Link
STNoC
13
Atmel Roma vs SHAPES and CASTNESS
Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview
Industrial Interest for Multi Tile Systems-on-Chip
Large silicon area is needed for high Average Selling Price/unit: Multiple tile -> the way to design chips area >= 50mm2 on future technologies -> Industrial Interest of Semiconductor Manufacturers to avoid a low profit commodities-like industry
$ and # of embedded processors / persons increasing faster than conventional processors / persons
# of (phones, games, pdas, cars, home, medical, wearable) vs PC
Collision/convergence on architectures is going to happen: Because of changes on key driving markets Because full systems can be integrated on a chip Because of deep submicron technological facts:
- WIRING, COMPLEXITY, POWER
This time, …we are not in 1980, when a simpler solution was achievable through higher clock rate and monolithic architectures…we need multi-processor parallelism.
European Tiles HW Research background exists to help, for example INFN solved the wiring problem on cubic meters using tiles + first neighbours 3D toroidal point-to-point physical links. Today we have face the same problem also for on-chip communications
Embedded Systems versus
Classical Computing
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 14
HW Background: Istituto Nazionale Fisica Nucleare APE HW Background: Istituto Nazionale Fisica Nucleare APE family of Massive Parallel custom Very Long family of Massive Parallel custom Very Long Instruction Word Floating-Point Procs. + 3D first neighbour Instruction Word Floating-Point Procs. + 3D first neighbour toroidal communication for Numerical Physics Simulationstoroidal communication for Numerical Physics Simulations
APE(1984-1988)
APE100(1988-1993)
APEmille(1994-1999)
apeNEXT(2000-2005)
Architecture SIMD SIMD SIMD SIMD++
# nodes 16 2048 2048 4096
Topology flexible 1D rigid 3D flexible 3D flexible 3D
Aggregated memory
256 MB 8 GB 64 GB 1 TB
# registers (w.size)
64 (x32) 128 (x32) 512 (x32) 512 (x64)
LOW Clock frequency
8 MHz 25 MHz 66 MHz 200 MHz
Comp. Power/node
64 Mflops 50 Mflops 528 Mflops 1600 Mflops
Aggregated Comp. Power
1 GFlops 100 GFlops 1 TFlops 7 TFlops
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 15
APENext (2005) 2048 processor system, VLIW APENext (2005) 2048 processor system, VLIW processors manufactured by ATMELprocessors manufactured by ATMEL
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 16
APEmille (1999) – 1 TFlopsAPEmille (1999) – 1 TFlops
2048 VLSI processing nodes
SIMD, synchronous communications
Fully integrated ”Host computer”, 64 PCs cPCI based
Computing node “Processing Board” (PB)
8 nodes, 4GFlops “Torre”
32 PB, 128GFlops
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 17
APE100 (1993) - 100 GFlopsAPE100 (1993) - 100 GFlops
PB (8 nodes) ~ 400 MFlops
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 18
……from INFN APE to Atmel from INFN APE to Atmel Roma MPSoC tile… Roma MPSoC tile…
1997- 2001
Spin-off from INFN and Creation of IPITEC start-up (Intellectual Property Initiative for Tools and Embedded Cores) – (P.S. Paolucci, B. Altieri)
2002-2004
mAgic VLIW DSP synthesizable core
IPITEC becomes ATMEL Roma Advanced DSP Products ATMEL
2003: Diopsis 740: 1 Gigaflops mAgic DSP + Arm7
2007: Diopsis 940: 1 Gigaflops mAgicV DSP (fully C programmable) + Arm9
2007-2009: Shapes Tile
Diopsis 740 tile: A gigaflops VLIW+RISC SoC Tile - HotChips 15 Conference – Stanford (2003)
……toward SHAPES tiletoward SHAPES tile
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 19
SW challenges from SW challenges from Tiled ArchitecturesTiled Architectures
Facilitate expression of parallelism: e.g. Network of Actors Express real time constraints in a formal manner, feature missing in
classical languages. This is a key cultural point!!! Avoid destroying information about available algorithm parallelism Compilation chain must fully aware of key architectural parameters:
bandwidth, computational power, pipeline and latencies Exploit memory locality – efficient management of Distributed Memories –
get rid of classical chaches Manage Long delays between distant tiles Reduce Hot Spots in communications Reduce RTOS overhead (time and memory footprint) - Networked RTOS?
Minimal RTOS on each tile? Introduce Hardware dependent Software and Hardware Abstraction Layers Capture scalability in a library of characterized SW/HW components Support for (semi)-automation of iterative design over HW, SW, Appl Monitor quality and real-time constraints on real HW and Simulators Simulation speed of multi-tiled architectures
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview20
Research Research LinesLines
System SW•ETH Zurich - Distributed Operation Layer; manage application parallelism•RWTH Aachen Univ. - Simulation of Heterogeneous Multi Proc. Systems•TIMA Lab and THALES - Hardware dependent Software Layer and RTOS•TARGET Compiler Tech. - Retargetable VLIW Compilers
System HW•ATMEL Roma - Tile:
– Evolution of (DiopsisTM: mAgicV VLIW DSPTM + RISC) + INFN DNPTM
•INFN Roma – DNPTM Distributed Network Processor + 3D Toroidal Eng.:– Evolution of APE Massive Parallel Processors
•STMicrolectronics + Univ. of Cagliari and Pisa – – Evolution of SpidergonTM Packet Switching Network on Chip
Parallel application benchmarking•Fraunhofer IDMT,IGD - Audio Wave Field Synthesis and Graphic Algorithm •PIE, MedCom - Ultrasound scanner•INFN - Physical Modelling
Scalable Software Hardware Architecture Scalable Software Hardware Architecture Platform for Embedded Systems (2006-2009)Platform for Embedded Systems (2006-2009)FP6-IST – Future Emerging Technologies-FP6-IST – Future Emerging Technologies-Advanced Computer Architecture ProjectAdvanced Computer Architecture Project
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview21
SW Environment – Summary of Working PrinciplesSW Environment – Summary of Working Principles
Model Based Application Description
– Network of Interacting Componentsincl. non-functional constraints, analytical predictions and run-time profiling
Distributed Operation Layer
– Maps components on Processing and Networking Resources
– Stepwise approach to semi-automated mapping:
• By hand, assisted by simulation, run-time profiling and analytical models
• By algorithms for automated multi-objective randomised search
Target Applications
– Extensive inherent parallelism Optimised compilation on tiles and comms network
Distributed Operation Layer
hardware platform specification
Simulator
trace information
Model Compiler
component interaction,
properties and constraints
componentsource code
mappinginformation
HdS Generator
HdSsource code
Compiler
componentbinary
HdSbinary
LinkDispatch
OS servicesbinary
gluebinary
Mapping
Memory mapping
RTOS
applicationspecs
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview22
Tiled HW ArchitectureTiled HW Architecture
Communication Centric, not Processor Centric
Homogeneous SW interface for on-chip and off-chip scalable connection and I/O
3D first-heighbour Toroidal System Eng. (3DT) for Off-Chip communication
Virtual tunnelling on packed switching NoC (Network on Chip) and off-chip 3DT
Parallelism Aware System SW: Manage memory distribution
Explicit parallel programming/Network of Actors
0 1 2 3 4
5
6
7
89101112
15
14
13
0 1 2 3 4
5
6
7
89101112
15
14
13
FPGA
DAC actuator
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
Tile
Tile
Tile
Tile
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
OFF-CHIP MEM
DSPDNP
Multi-Layer BUS
NoC
RISC
POT
3DT Off-chipcommunication
DXM
Tile
ADC sensor
DAC actuator
ADC sensor
ADC/DAC