a risc architecture extended by an efficient tightly coupled reconfigurable unit

15
A RISC ARCHITECTURE EXTENDED BY AN A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT RECONFIGURABLE UNIT Nikolaos Vassiliadis Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Section of Electronics and Computers, Department of Physics, Physics, Aristotle University of Thessaloniki, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece 54124 Thessaloniki, Greece [email protected] [email protected] Algarve, Portugal Algarve, Portugal February 22-23, 2005 February 22-23, 2005

Upload: love

Post on 07-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT. Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

A RISC ARCHITECTURE EXTENDED BY A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED AN EFFICIENT TIGHTLY COUPLED

RECONFIGURABLE UNITRECONFIGURABLE UNIT

Nikolaos VassiliadisNikolaos VassiliadisN. Kavvadias, G. Theodoridis, S. NikolaidisN. Kavvadias, G. Theodoridis, S. Nikolaidis

Section of Electronics and Computers, Department of Physics,Section of Electronics and Computers, Department of Physics,

Aristotle University of Thessaloniki,Aristotle University of Thessaloniki,

54124 Thessaloniki, Greece54124 Thessaloniki, Greece

[email protected]@skiathos.physics.auth.gr

Algarve, PortugalAlgarve, PortugalFebruary 22-23, 2005February 22-23, 2005

Page 2: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

22

OutlineOutline

MotivationsMotivations

Proposed ArchitectureProposed Architecture

Software Development Environment Software Development Environment

DemonstrationDemonstration

ResultsResults

ConclusionsConclusions

Page 3: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

33

MotivationsMotivationsQuest for Performance and FlexibilityQuest for Performance and Flexibility

Large portion of computational complexity is concentrated in Large portion of computational complexity is concentrated in small kernels covering small parts of overall codesmall kernels covering small parts of overall code

Performance Improved by Accelerating these kernelsPerformance Improved by Accelerating these kernels

Many Algorithms Show a relevant Instruction Level Parallelism Many Algorithms Show a relevant Instruction Level Parallelism (ILP)(ILP)

Performance Improved by parallel executionPerformance Improved by parallel execution

Traditional Processors have computation clock slackTraditional Processors have computation clock slack Performance Improved by chaining of operations (Spatial Computation) Performance Improved by chaining of operations (Spatial Computation)

Extending Embedded Processors With Application Specific Function Units

Reconfigurable Instruction Set Processors for Performance with Maximum Flexibility

Page 4: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

44

Proposed ArchitectureProposed Architecture

Reconfigurable Instruction Set Processor (RISP)Reconfigurable Instruction Set Processor (RISP)Core ProcessorCore Processor

32-bit load/store RISC architecture32-bit load/store RISC architecture 5 Pipeline Stages5 Pipeline Stages Single Issue ElaborationSingle Issue Elaboration

Reconfigurable Logic CouplingReconfigurable Logic Coupling Reconfigurable Function Unit (RFU) approachReconfigurable Function Unit (RFU) approach=> Low Communication Overhead=> Low Communication Overhead Tightly Coupled => RFU Fits in two RISC pipeline stagesTightly Coupled => RFU Fits in two RISC pipeline stages=> Better Utilization of the Pipeline Stages=> Better Utilization of the Pipeline Stages

RFURFU 1-D Array of Coarse Grain Processing Elements (PEs)1-D Array of Coarse Grain Processing Elements (PEs) PE Functionality Configurable at Design Time to meet PE Functionality Configurable at Design Time to meet

Application requirementsApplication requirements Exploits Instruction Level Parallelism – Spatial & Temporal Exploits Instruction Level Parallelism – Spatial & Temporal

ComputationComputation

Page 5: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

55

CONTROL LOGIC

RE

GIS

TE

R F

IL

E

ALUM

UX

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

PIP

EL

IN

E R

EG

IS

TE

R

MULTIPLIER

SHIFTERDATA

MEMORY

CORE / RFU INTERFACE

PROCESSING & INTERCONNECT LAYERSCONFIGURATION LAYER

WRITE BACK DATA

CONTROL SIGNALS

I_DATA_INBUS

OPERANDS

1ST STAGE RESULT 2ND STAGE RESULT

Re OPCODE

STATUS SIGNALS

CONFIGURATION BITS

Proposed ArchitectureProposed Architecture

Core ProcessorCore Processor Commonly Used Function Commonly Used Function

UnitsUnits Control Logic Properly Control Logic Properly

Extended to Handle Extended to Handle Reconfigurable InstructionsReconfigurable Instructions

4-Read-1-Write Register File4-Read-1-Write Register File

Core / RFU InterfaceCore / RFU Interface Receives & Delivers Control Receives & Delivers Control

and Data Signalsand Data Signals

Tightly Coupled RFUTightly Coupled RFU Configuration-Processing-Configuration-Processing-

Interconnection LayersInterconnection Layers Operates & Delivers Results Operates & Delivers Results

in two Concurrent Pipeline in two Concurrent Pipeline StagesStages

Page 6: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

66

Standard And Reconfigurable InstructionsStandard And Reconfigurable Instructions

Re=‘0’ => Standard InstructionRe=‘0’ => Standard Instruction Control Logic : Configure Core DatapathControl Logic : Configure Core Datapath Operands : Source1-2 & DestinationOperands : Source1-2 & Destination ReOpCode = “nop”ReOpCode = “nop”

Re=‘1’ => Reconfigurable InstructionRe=‘1’ => Reconfigurable Instruction Control Logic : Configure InterfaceControl Logic : Configure Interface Operands : Source1-4 & DestinationOperands : Source1-4 & Destination ReOpCode = “OpCode”ReOpCode = “OpCode”

Three Types of Reconfigurable InstructionsThree Types of Reconfigurable Instructions Complex Computational OperationsComplex Computational Operations Complex Addressing ModesComplex Addressing Modes Complex Control Flow OperationsComplex Control Flow Operations

Each Instruction can be multicycleEach Instruction can be multicycle

Re OpCode Source 1 Source 2 Destination Source 3 Source 4

32-Bit Instruction Word Format32-Bit Instruction Word Format

Page 7: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

77

Reconfigurable Function Unit (RFU)Reconfigurable Function Unit (RFU)

Embedded RFU for Dynamic Extension of the Instruction Embedded RFU for Dynamic Extension of the Instruction SetSet

Executes Multiple-Input-Single-Output (MISO) Executes Multiple-Input-Single-Output (MISO) Reconfigurable InstructionsReconfigurable Instructions

1-D Array of Coarse Grain Reconfigurable Blocks1-D Array of Coarse Grain Reconfigurable Blocks

Comprised of Three LayersComprised of Three Layers Processing LayerProcessing Layer Interconnection LayerInterconnection Layer Configuration LayerConfiguration Layer

Page 8: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

88

RFU-Processing LayerRFU-Processing Layer

PE Basic StructurePE Basic StructureConfigurable PE functionality for Configurable PE functionality for the targeted applicationthe targeted applicationUnregistered Output => Spatial Unregistered Output => Spatial ComputationComputationRegister Output => Temporal Register Output => Temporal ComputationComputationFloating PEs => Can operate in Floating PEs => Can operate in both core pipeline stages on both core pipeline stages on demanddemandLocal Memory for Read Only Local Memory for Read Only ValuesValuesExecute Long Chains of Execute Long Chains of Operation in one processor Operation in one processor cyclecycle

PE REGISTER

MU

X

Operand1

Operand2

Function Sel Spatial-Temporal Sel

Result

Page 9: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

99

RFU-Interconnection LayerRFU-Interconnection Layer

1-D Array of PEs1-D Array of PEs

Operands from Operands from Register FileRegister File

Constant Values from Constant Values from Local MemoryLocal Memory

Input NetworkInput Network

Operand SelectOperand Select

Output Network => Output Network => Delivers Results to Delivers Results to corresponding pipeline corresponding pipeline stages stages

INPUT NETWORK

OUTPUT NETWORK

PE BASIC STRUCTURE

OPERAND SELECT

OPERAND1

OPERAND2

PE RESULT

PE BASIC STRUCTURE

OPERAND SELECT

OPERAND1

OPERAND2

PE RESULT

1ST STAGE RESULT

2ND STAGE RESULT

FEEDBACK NETWORK

1ST STAGE OPERANDS

2ND STAGE OPERANDS

OPERANDS

CONSTANTS

Page 10: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1010

RFU-Configuration LayerRFU-Configuration Layer

Configuration Bits Local Configuration Bits Local Storage StructureStorage Structure

Multi-Context Multi-Context Configuration LayerConfiguration Layer

Coarse Grain => Small Coarse Grain => Small Number of Configuration Number of Configuration Bits => Negligible Bits => Negligible Overhead to Download Overhead to Download new Contextsnew Contexts

EXTERNAL CONFIGURATION

MEMORY

CO

NF

IGU

RA

TIO

N

CO

NT

RO

LL

ER

CONFIGURATION BITS LOCAL STORAGE

CONFIGURATION 0

CONFIGURATION 1

CONFIGURATION 2

CONFIGURATION 3

CONFIGURATION BITS

Page 11: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1111

Architecture Synthesis & EvaluationArchitecture Synthesis & Evaluation

A Hardware Model (VHDL) A Hardware Model (VHDL) was Designed for Evaluation was Designed for Evaluation PurposesPurposes

Configuration Value

Granularity 32-bits

Number of Processing Elements 8

Processing Elements FunctionalityALU, Shifter,

Multiplier

Configuration Contexts 16 words of 134 bits

Local Memory Size 8 constants of 32-bits

Number of Provided Local Operands 4

Component Area (mm2)

Processor Core 0.134

RFU Processing Layer 0.186

RFU Interconnection Layer

0.125

RFU Configuration Layer 0.137

RFU Total 0.448

The Model was Synthesized with The Model was Synthesized with STM 0.13um ProcessSTM 0.13um Process

The RFU Area Overhead is 3.3x The RFU Area Overhead is 3.3x the Area of the Core Processorthe Area of the Core Processor

No Caches were taken into No Caches were taken into accountaccount

No Overhead to Core Critical PathNo Overhead to Core Critical Path

Page 12: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1212

Software Development EnvironmentSoftware Development Environment

Front-End Compilation

Application Code(C)

Application CDFG(SUIFvm)

Application Analysis

Weighted Application CDFG(SUIFvm)

Instruction Generation

Application CDFG(SUIFvm+Instruction Extensions)

Mapping + Code Generation

Executable CodeSimulation-

Profiling

Static : Count InstructionsDynamic : Estimate Frequencies

Instruction Generation : MaxMISOInstruction Selection : Max Gain

RFU : MaxMISO MappingCore : Code Generation

Revises for Finer Results

MachSUIF +Machine Independent Optimizations

Page 13: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1313

ADD

SUB

NEG SHIFT

NEG

SHIFT

register register constant

register constant

register

Demonstration-RFU ElaborationDemonstration-RFU Elaboration

Largest MaxMISO for a Largest MaxMISO for a Quantization KernelQuantization Kernel

Execution on the Core => Execution on the Core => six cyclessix cycles

Execution on the Core+RFU Execution on the Core+RFU => one cycle=> one cycle

Performance ImprovementsPerformance Improvements

Reduced Instruction Reduced Instruction Memory AccessesMemory Accesses

Temporal Computation

Deliver Result in 2nd Pipeline Stage

Map to PEsILP+Spatial Computation1st Execution Stage

Map to PEsILP+Spatial Computation2nd Execution Stage

Page 14: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1414

ResultsResults

0

0,2

0,4

0,6

0,8

1

CRC FIR FFT QUANT VLC

Normalized Energy Consumption

Core

Core+RFU

CRC FIR FFT QUANT VLC

1.6x 1.8x 2.8x 1.9x 1.7x

Energy Consumption Dominated by Memory Accesses

Speed-Ups for Several Kernels – Core Vs. Core+RFU

Page 15: A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

1515

ConclusionsConclusions

A RISC Processor Enhanced by a Run-Time A RISC Processor Enhanced by a Run-Time Reconfigurable Function UnitReconfigurable Function Unit

1-D Reconfigurable Array of Coarse Grain Processing 1-D Reconfigurable Array of Coarse Grain Processing ElementsElements

Multiple-Input-Single-Output Reconfigurable InstructionsMultiple-Input-Single-Output Reconfigurable Instructions

Specific Software Development EnvironmentSpecific Software Development Environment

Low Cost Performance and Energy Consumption Low Cost Performance and Energy Consumption ImprovementsImprovements

Next Step => Expand to VLIW Elaboration to Boost Next Step => Expand to VLIW Elaboration to Boost Achieved Speed-UpsAchieved Speed-Ups