pier stanislao paolucci - atmel roma and infn roma - shapes hw overview 1 coordinator of the...

22
Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto Nazionale di Fisica Nucleare Roma – Italy 4 generations of APE Massive Parallel Processors based on custom VLIW DSP (part time) Technology Director and Founder ATMEL Roma, corporate design center of ATMEL for Advanced DSP 2 generations of RISC + custom mAgic VLIW DSP MPSoC SHAPES: a tiled scalable Software Hardware SHAPES: a tiled scalable Software Hardware Architecture Platform for Embedded Systems Architecture Platform for Embedded Systems - - HW overview HW overview P.S. Paolucci (INFN and Atmel Roma) for the SHAPES Consortium P.S. Paolucci (INFN and Atmel Roma) for the SHAPES Consortium

Upload: crystal-tate

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1

Coordinator of the European Project

Permanent Staff Researcher (part time)

Istituto Nazionale di Fisica Nucleare

Roma – Italy

4 generations of APE Massive Parallel Processors based on custom VLIW DSP

(part time) Technology Director and Founder

ATMEL Roma, corporate design center of ATMEL for Advanced DSP

2 generations of RISC + custom mAgic VLIW DSP MPSoC

SHAPES: a tiled scalable Software Hardware SHAPES: a tiled scalable Software Hardware Architecture Platform for Embedded SystemsArchitecture Platform for Embedded Systems

--HW overviewHW overview

P.S. Paolucci (INFN and Atmel Roma) for the SHAPES ConsortiumP.S. Paolucci (INFN and Atmel Roma) for the SHAPES Consortium

Page 2: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 2

Deep Sub-micron Architectures…Deep Sub-micron Architectures… ~160 MGate available on a 100 mm2 chip (45nm CMOS, 2008) Increasing GATES/CHIP Design Complexity Management:

embedded processors use a few million gates only, IP reuse possible and needed; WIRING threatens Moore’s law:

Wiring delay increases on new CMOS silicon generations

The full chip cannot be reached in a single clock cycle

Classic monolithic processor architectures do not scale

Locally Synchronous, Globally Asynchronous needed

Communication Centric SW and HW Architecture needed

… PROPOSED SOLUTION: … TILED ARCHITECTURE…BY SIMPLE GEOMETRIC DEMONSTRATION… IF CONSTANT LOGIC COMPLEXITY INSIDE EACH TILE… THEN (LENGTH OF INTRA-TILE WIRES SCALE DOWN AS THE TILE ITSELF… AND SHORT ~ FIRST NEIGHBOURS ON-CHIP AND OFF-CHIP INTER-TILE WIRES)

QUEST OF BEST TILE, ON-CHIP AND OFF-CHIP INTERCONNECT. BUT HOW TO PROGRAM? EXPLICIT PARALLEL PROGRAMMING PARADIGM, and CULTURE NEEDED

POWER DISSIPATION density approaching prohibitive values if higher clock speed used; much better Oper/Watt at moderate clock + parallelism (the human brain parallel architecture performs an excellent job at 50 HZ!... room for improvement)

Page 3: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 3

Tiled HW ArchitectureTiled HW Architecture

Communication Centric, not Processor Centric

Homogeneous SW interface for on-chip and off-chip scalable connection and I/O

3D first-neighbour Toroidal System Eng. (3DT) for Off-Chip communication

Virtual tunnelling on packed switching NoC (Network on Chip) and off-chip 3DT

Parallelism Aware System SW: Manage memory distribution, capture real time constraints

Explicit parallel programming/Network of Actors

0 1 2 3 4

5

6

7

89101112

15

14

13

0 1 2 3 4

5

6

7

89101112

15

14

13

FPGA

DAC actuator

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

DSPDNP

Multi-Layer BUS

NoC

RISC

POT

3DT Off-chipcommunication

DXM

Tile

ADC sensor

DAC actuator

ADC sensor

ADC/DAC

Page 4: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 4

HW GLOSSARYHW GLOSSARYAT THE CHIP LEVEL MTC: Multiple Tile Chip (composed of

multiple Tiles) NOC: Network On Chip (connecting Tiles) 3DT: 3 Dim Toroidal Connection (outside

the chip)

FUNDAMENTAL TYPE OF TILE RDT includes:

RISC: (includes on chip memories RDM and RPM) +

DSP(includes on chip memories DDM and DPM)

DNP + DXM (off-chip mem) + POT (e.g. DAC/ADC conv)

POSSIBLE TILE VARIANTS

(subset of RDT) RET := RDT minus DSP DET:= RDT minus RISC DDT:= DET minus DXM

INSIDE THE TILE Multilayer Bus Matrix sustains multiple

simultaneous transfers RISC max one per tile DSP one or more per tile DNP: Distributed Network Processor (always

one per tile) DDM: on-chip Distributed Data Mem (inside

the DSP) DPM: on-chip Distributed Progr Mem (inside

the DSP) DXM: Distributed eXternal Mem Interface

(max one per tile, outside the RISC and DSP) POT: Peripherals On Tile RDM: Risc (tightly coupled) Data Memory RPM: Risc (tightly coupled) Program Memory RCM: Risc Cache Memory

Page 5: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 5

Different Different Types Types

of of TilesTiles

DSP

DNP

Multi-Layer BUS

NoC

RISC POT

3DT

DXM

DNP

Multi-Layer BUS

NoC

RISC POT

3DT

DXM

DNP

Multi-Layer BUS

NoC

DSP POT

3DT

DXM

RDT: RISC + DSP Elementary Tile

RET: RISC Elementary Tile DET: DSP Elementary Tile

RDT

RET DET

DXM Mem Bus POT Pads

DXM Mem Bus POT Pads

Page 6: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 6

Distributed Network ProcessorDistributed Network ProcessorDNP: InterfaceDNP: Interface

DNP

BUS Master (to read from intra-tile memories)

BUS Slave (to receive commands from RISC & DSP)

3DT X+ (to forward/receive inter-tile off chip packets)

3DT X-

3DT Y+

3DT Y-

3DT Z+

3DT Z-

NoC (to forward/receive inter-tile on-chip packets)

BUS Master (simultaneous intra-tile memory write)

Collective communication

Page 7: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

7

Atmel Roma vs SHAPES and CASTNESS

Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview

mAgicV IP Architecture (Fully C programmable Gigaflops VLIW DSP)

4-address/cycleMultiple DSP Address

GenerationUnit

16 multi-field Address Register File10-float

ops/cycle

8R+8W 128x40Data Register File

System

2-port, 8Kx128-bit, VLIW Program Memory(DPM)

Flow Controller, VLIW Decoder

Instruction Decoder

Condition Generation

Status Register

Program Counter

VLIW Decompressor

6-access/cycleData Memory

System 2x8Kx40

(DDM)

AHB Slave,

e.g.DMA

Target

AHB MST

AHB MasterDMA

Engine

AHB SLVRST, CLOCKSIRQ OUTIRQ INDBG

Page 8: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

8

Atmel Roma vs SHAPES and CASTNESS

Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview

Silicon Floorplan

of Diopsis

940:

RISC + mAgicV

VLIW DSP

DSP Reg File

DSP DataMem(DDM)

DSP Prog Mem (DPM)

DSP Logic

ARM926ARMRDM

Peripherals

AMBA Multilayer

Page 9: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 9

The tile:The tile:

JTAGROM

Bridge

DXM Interface(AHB EBI)

RDM SRAM

PDMA

mAgicV DSPTM JTAG

DSPAHB

Master

4-addr/cycle

Multiple DSP Addr Gen

10-float ops/cycle

16-port 256x40

Data Regs

mAgicVTM DPM2-port

DDM6-access/

cycle

DSPAHB Slave

Slave

ICE

RISC

RCM Instr Cache MMU RCM Data CacheRDM IF BIUI D I D

Master

Multi-layerBus MATRIX

APBDNPAHB

Master

DNPAHB Slave

DNPAHB

Master

DNP

X+

DXM

X-

Y+

Y-

Z+

Z-

C+

NoC(NI)

P E R I P H E R A L S

Diopsis +Diopsis + DNPDNP

Page 10: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

10

Atmel Roma vs SHAPES and CASTNESS

Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview

Tile Complexity

mAgicV DSP: 915 Kgates + 1 Mbit Prog Mem + 640 Kbit Data Mem

ARM926 & peripherals <2 equivalent Mgate (including 640 Kbit mem)

Tile Complexity 4230 equivalent Kgate + DNP gate count

- including on chip memories

Page 11: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 11

Spidergon topology• It’s a family of regular/symmetric topologies

• We look for a complexity/performance trade-off• Low degree (router cost)• Low number of links (wire cost) • Symmetry (homogeneous building blocks; simple routing)• Low diameter (performance)• Good scalability (small network size granularity)

Page 12: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 12

STNoC key components

Network on Chip is a set of on-chip routers (up to layer 3),

Network Interfaces (NI) (layer 4) and physical Link

NI

router

IP

IP

linkNI

KernelKernelShellShellIP Router Phy Link

Ph

yL

ink

Ph

yL

ink

Phy Link

STNoC

Page 13: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

13

Atmel Roma vs SHAPES and CASTNESS

Pier Stanislao Paolucci – INFN and Atmel Roma – SHAPES HW Overview

Industrial Interest for Multi Tile Systems-on-Chip

Large silicon area is needed for high Average Selling Price/unit: Multiple tile -> the way to design chips area >= 50mm2 on future technologies -> Industrial Interest of Semiconductor Manufacturers to avoid a low profit commodities-like industry

$ and # of embedded processors / persons increasing faster than conventional processors / persons

# of (phones, games, pdas, cars, home, medical, wearable) vs PC

Collision/convergence on architectures is going to happen: Because of changes on key driving markets Because full systems can be integrated on a chip Because of deep submicron technological facts:

- WIRING, COMPLEXITY, POWER

This time, …we are not in 1980, when a simpler solution was achievable through higher clock rate and monolithic architectures…we need multi-processor parallelism.

European Tiles HW Research background exists to help, for example INFN solved the wiring problem on cubic meters using tiles + first neighbours 3D toroidal point-to-point physical links. Today we have face the same problem also for on-chip communications

Embedded Systems versus

Classical Computing

Page 14: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 14

HW Background: Istituto Nazionale Fisica Nucleare APE HW Background: Istituto Nazionale Fisica Nucleare APE family of Massive Parallel custom Very Long family of Massive Parallel custom Very Long Instruction Word Floating-Point Procs. + 3D first neighbour Instruction Word Floating-Point Procs. + 3D first neighbour toroidal communication for Numerical Physics Simulationstoroidal communication for Numerical Physics Simulations

APE(1984-1988)

APE100(1988-1993)

APEmille(1994-1999)

apeNEXT(2000-2005)

Architecture SIMD SIMD SIMD SIMD++

# nodes 16 2048 2048 4096

Topology flexible 1D rigid 3D flexible 3D flexible 3D

Aggregated memory

256 MB 8 GB 64 GB 1 TB

# registers (w.size)

64 (x32) 128 (x32) 512 (x32) 512 (x64)

LOW Clock frequency

8 MHz 25 MHz 66 MHz 200 MHz

Comp. Power/node

64 Mflops 50 Mflops 528 Mflops 1600 Mflops

Aggregated Comp. Power

1 GFlops 100 GFlops 1 TFlops 7 TFlops

Page 15: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 15

APENext (2005) 2048 processor system, VLIW APENext (2005) 2048 processor system, VLIW processors manufactured by ATMELprocessors manufactured by ATMEL

Page 16: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 16

APEmille (1999) – 1 TFlopsAPEmille (1999) – 1 TFlops

2048 VLSI processing nodes

SIMD, synchronous communications

Fully integrated ”Host computer”, 64 PCs cPCI based

Computing node “Processing Board” (PB)

8 nodes, 4GFlops “Torre”

32 PB, 128GFlops

Page 17: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 17

APE100 (1993) - 100 GFlopsAPE100 (1993) - 100 GFlops

PB (8 nodes) ~ 400 MFlops

Page 18: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 18

……from INFN APE to Atmel from INFN APE to Atmel Roma MPSoC tile… Roma MPSoC tile…

1997- 2001

Spin-off from INFN and Creation of IPITEC start-up (Intellectual Property Initiative for Tools and Embedded Cores) – (P.S. Paolucci, B. Altieri)

2002-2004

mAgic VLIW DSP synthesizable core

IPITEC becomes ATMEL Roma Advanced DSP Products ATMEL

2003: Diopsis 740: 1 Gigaflops mAgic DSP + Arm7

2007: Diopsis 940: 1 Gigaflops mAgicV DSP (fully C programmable) + Arm9

2007-2009: Shapes Tile

Diopsis 740 tile: A gigaflops VLIW+RISC SoC Tile - HotChips 15 Conference – Stanford (2003)

……toward SHAPES tiletoward SHAPES tile

Page 19: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 19

SW challenges from SW challenges from Tiled ArchitecturesTiled Architectures

Facilitate expression of parallelism: e.g. Network of Actors Express real time constraints in a formal manner, feature missing in

classical languages. This is a key cultural point!!! Avoid destroying information about available algorithm parallelism Compilation chain must fully aware of key architectural parameters:

bandwidth, computational power, pipeline and latencies Exploit memory locality – efficient management of Distributed Memories –

get rid of classical chaches Manage Long delays between distant tiles Reduce Hot Spots in communications Reduce RTOS overhead (time and memory footprint) - Networked RTOS?

Minimal RTOS on each tile? Introduce Hardware dependent Software and Hardware Abstraction Layers Capture scalability in a library of characterized SW/HW components Support for (semi)-automation of iterative design over HW, SW, Appl Monitor quality and real-time constraints on real HW and Simulators Simulation speed of multi-tiled architectures

Page 20: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview20

Research Research LinesLines

System SW•ETH Zurich - Distributed Operation Layer; manage application parallelism•RWTH Aachen Univ. - Simulation of Heterogeneous Multi Proc. Systems•TIMA Lab and THALES - Hardware dependent Software Layer and RTOS•TARGET Compiler Tech. - Retargetable VLIW Compilers

System HW•ATMEL Roma - Tile:

– Evolution of (DiopsisTM: mAgicV VLIW DSPTM + RISC) + INFN DNPTM

•INFN Roma – DNPTM Distributed Network Processor + 3D Toroidal Eng.:– Evolution of APE Massive Parallel Processors

•STMicrolectronics + Univ. of Cagliari and Pisa – – Evolution of SpidergonTM Packet Switching Network on Chip

Parallel application benchmarking•Fraunhofer IDMT,IGD - Audio Wave Field Synthesis and Graphic Algorithm •PIE, MedCom - Ultrasound scanner•INFN - Physical Modelling

Scalable Software Hardware Architecture Scalable Software Hardware Architecture Platform for Embedded Systems (2006-2009)Platform for Embedded Systems (2006-2009)FP6-IST – Future Emerging Technologies-FP6-IST – Future Emerging Technologies-Advanced Computer Architecture ProjectAdvanced Computer Architecture Project

Page 21: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview21

SW Environment – Summary of Working PrinciplesSW Environment – Summary of Working Principles

Model Based Application Description

– Network of Interacting Componentsincl. non-functional constraints, analytical predictions and run-time profiling

Distributed Operation Layer

– Maps components on Processing and Networking Resources

– Stepwise approach to semi-automated mapping:

• By hand, assisted by simulation, run-time profiling and analytical models

• By algorithms for automated multi-objective randomised search

Target Applications

– Extensive inherent parallelism Optimised compilation on tiles and comms network

Distributed Operation Layer

hardware platform specification

Simulator

trace information

Model Compiler

component interaction,

properties and constraints

componentsource code

mappinginformation

HdS Generator

HdSsource code

Compiler

componentbinary

HdSbinary

LinkDispatch

OS servicesbinary

gluebinary

Mapping

Memory mapping

RTOS

applicationspecs

Page 22: Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview 1 Coordinator of the European Project Permanent Staff Researcher (part time) Istituto

Pier Stanislao Paolucci - Atmel Roma and INFN Roma - SHAPES HW Overview22

Tiled HW ArchitectureTiled HW Architecture

Communication Centric, not Processor Centric

Homogeneous SW interface for on-chip and off-chip scalable connection and I/O

3D first-heighbour Toroidal System Eng. (3DT) for Off-Chip communication

Virtual tunnelling on packed switching NoC (Network on Chip) and off-chip 3DT

Parallelism Aware System SW: Manage memory distribution

Explicit parallel programming/Network of Actors

0 1 2 3 4

5

6

7

89101112

15

14

13

0 1 2 3 4

5

6

7

89101112

15

14

13

FPGA

DAC actuator

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

Tile

Tile

Tile

Tile

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

OFF-CHIP MEM

DSPDNP

Multi-Layer BUS

NoC

RISC

POT

3DT Off-chipcommunication

DXM

Tile

ADC sensor

DAC actuator

ADC sensor

ADC/DAC