co-design of programmable exposed datapath processors · co-design of programmable exposed datapath...

18
Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University of Hannover. February 16, 2017. Dr. Pekka Jääskeläinen [email protected] Academy of Finland postdoctoral researcher Customized Parallel Computing group http://cpc.cs.tut.fi Faculty of Computing and Electrical Engineering Tampere University of Technology, Finland

Upload: vuongbao

Post on 04-Aug-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Co-Design of Programmable

Exposed Datapath Processors

In 2nd Tensilica Day,

Institute of Microelectronic Systems,

University of Hannover. February 16, 2017.

Dr. Pekka Jääskeläinen

[email protected]

Academy of Finland postdoctoral researcher

Customized Parallel Computing group

http://cpc.cs.tut.fi

Faculty of Computing and Electrical Engineering

Tampere University of Technology, Finland

Page 2: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Presentation Outline

Transport Triggered Architectures (TTA):Exposing the datapath to the programmer

TTA-Based Co-Design Environment (TCE):A toolset for custom TTA design and programming

Examples of recent TCE design cases

Questions and answers

17.2.2017 2

Page 3: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

TRANSPORT TRIGGERED

ARCHITECTURES

17.2.2017 3

Page 4: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Exposed/Explicit Datapath

• TTA programming model is very fine grained: the

programmer controls the datapath operand/result

transfers explicitly

• Other academic and commercial architectures with

”exposed datapath” features:

– FlexCore, Stanford ELM, Static Pipelining, ARM Mali,

some AMD GPUs, the Mill/Belt

• The main benefit usually in reduced register file (RF)

pressure and simple control hardware

– Less RF ports needed, less energy wasted on RF

accesses

– Excel in ”number crunching” scenarios (VLIW)

17.2.2017 4

Page 5: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Brief History of Transport

Triggered Architectures

• Similar/related ideas already since 1950s

– English electric DEUCE (1953), GRI Computer GRI 909 (1969),

New England Digital’s ABLE (1973), Tabak&Lipovski: CMOVE

(1976)

• Studied in TU Delft and Eindhoven (since late 1980s)

– Using TTAs for scalable VLIW designs proposed by Prof. Henk

Corporaal et al.

• Ongoing research since early 2000s in our group

– Builds on Corporaal’s TTA research

– Focus on using TTAs for low power high performance

programmable designs, in addition to instruction level

parallelism, exploiting also data and thread level parallelism

– DSP/SDR design case studies

17.2.2017 5

Page 6: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

6

Traditional VLIW

Reg

iste

r F

ile

Instr

uction

Fetc

h

Instr

uction D

ecode

Data

Me

mory

Instr

uction M

em

ory

VLIW

FU-1

FU-2

FU-3

FU-4

FU-5 Byp

assin

g N

etw

ork

Page 7: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

7

Transport Triggered

ArchitectureIn

str

uction M

em

ory

Data

Me

mory

Instr

uction

Fetc

h

Instr

uction

Decode

Byp

assin

g N

etw

ork

FU-1

FU-2

FU-3

FU-4

FU-5

Register

FileTTA

Page 8: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

8

An Example TTA Datapath

Integer

ALU

Integer

ALUFPU

Boolean

RFFloat RF

Integer

RF

Socket

Instruction Memory

Data Memory

Load/Store

Unit

Load/Store

Unit

Immediate

UnitControl Unit

Bus

Connection

Page 9: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Reducing the RF Pressure with

TTA Code Optimizations

• Operand/result transport freedom

– Programmer can control the timing when the RF ports are

accessed (often there’s ”fine grained slack” in the program)

• Software (register file) bypassing

– The bypass network is controlled by the programmer: can

move data directly between FUs without going through RF

• Dead result move elimination

– Often all uses of a value can be bypassed: can eliminate the use

of the RF for that value completely

• Operand sharing

– If successive operations in the same FU use the same operand,

no need to read it again

17.2.2017 9

Page 10: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

10

Software Bypassing

add r3,r1,r2

sub r3,r3,99

Trad. processor

integer

RF

immediate

unit

integer

ALU

integer

ALU

load/store

unit

r1 -> add.o1, r2 -> add.o2

add.r -> sub.o1, 99 -> sub.o2

sub.r -> r3

TTA

Page 11: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

TTA-BASED CO-DESIGN

ENVIRONMENT (TCE)

17.2.2017 11

Page 12: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

TTA-Based Co-Design

Environment (TCE)

17.2.2017 12

• Core tool is tcecc: a retargetable Clang/LLVM-based compiler

• MIT-licensed community version available at http://tce.cs.tut.fi

• Used for several custom processor design cases

• Video compression, computer vision, software defined radio, audio...

• Mostly academic case studies, but also commercial cores

• A research platform for customized processors, compilation techniques etc.

Page 13: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

TCE Screenshots

17.2.2017 13

http://tce.cs.tut.fi

Page 14: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

EXAMPLES OF TCE DESIGN

CASES

17.2.2017 14

Page 15: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Example TCE design case by University of Turku:

A 5.3 pJ/op Approximate TTA VLIW

Tailored for Machine Learning

17.2.2017 15

• Minimum energy point 0.35 V near

threshold operating voltage for ultra

low power execution

• Features for approximate computing

– Detect errors in computation, replace

with safe values

• Manufactured on 28 nm FDSOI

• About 320 µW (incl. memories) on

ML workloads

• Published in Elsevier Microelectronics

Journal 61 (2017) 106–113

Page 16: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Example TCEMC design case by Leibniz Universität Hannover:

Customized High Performance Low Power

Processor for Binaural Speaker Localization

17.2.2017 16

• Custom DSP targeted to hearing aid

devices with support for advanced

algorithms:

– Very low power consumption, high

computational performance, small

form factor

• 32 x int32 SIMD (1024b) datapath

• Synthesized on 28 nm FDSOI

• 12 mW at 50 MHz, 1V

• 2-split SIMD RF, 1 write port each

– Only 10.5% of total power thanks to

software bypassing and DRE

• Published in 2016 IEEE International

Conference on Electronics, Circuits

and Systems (ICECS)

Page 17: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Summary

• TTA: a modular exposed datapath

architecture for data-oriented applications

• TCE: a processor design and programming

toolset based on TTA

– An open source version available

– Various design experiments done, from high

perf. multicores to ultra low power case

studies

17.2.2017 17

Page 18: Co-Design of Programmable Exposed Datapath Processors · Co-Design of Programmable Exposed Datapath Processors In 2nd Tensilica Day, Institute of Microelectronic Systems, University

Thanks!

Questions?

Pekka Jääskeläinen

[email protected]

Customized Parallel Computing group

http://cpc.cs.tut.fi