1 towards optimal custom instruction processors wayne luk kubilay atasu, rob dimond and oskar mencer...
TRANSCRIPT
1
Towards OptimalCustom Instruction Processors
Wayne LukKubilay Atasu, Rob Dimond and Oskar Mencer
Department of ComputingImperial College London
HOT CHIPS 18
2
Overview
1. background: extensible processors
2. design flow: C to custom processor silicon
3. instruction selection: bandwidth/area constraints
4. application-specific processor synthesis
5. results: 3x area delay product reduction
6. current and future work + summary
3
1. Instruction-set extensible processors
● base processor + custom logic– partition data-flow graphs into custom instructions
data out
ALURegister
File
data in
4
Previous work
● many techniques, e.g.– Atasu et al. (DAC 03)
– Goodwin and Petkov (CASES 03)
– Clark et al. (MICRO 03, HOT CHIPS 04)
● current challenges– optimality and robustness of heuristics
– complete tool chain: application to silicon
– research infrastructure for custom processor design
5
2. Custom processor research at Imperial
● focus on effective optimization techniques– e.g. Integer Linear Programming (ILP)
● complete tool-chain– high-level descriptions to custom processor silicon
● open infrastructure for research in– custom processor synthesis– automatic customization techniques
● current tools– optimizing compiler (Trimaran) for custom CPUs– custom processor synthesis tool
6
Application to custom processor flow
Application Source (C)
TemplateGeneration
TemplateSelection
AreaConstraint
GenerateCustom
Unit
GenerateBaseCPU
ProcessorDescription
ASICTools
Area,Timing
7
Custom instruction model
output ports
RegisterFile
input portsInput Register
Pipeline Register
Output Register
8
3. Optimal instruction identification● minimize schedule length of program
data flow graphs (DFGs)● subject to constraints
– convexity: ensure feasible schedules
– fixed processor critical path: pipeline for multi-cycle instructions
– fixed data bandwidth: limited by register file ports
● steps: based on Integer Linear Pogramming (ILP)
a. template generation
b. template selection
9
a. Template generation
X
1. Solve ILP for DFG to generate a template
2. Collapse template to a single DFG node
3. Repeat while (objective > 0)
10
b. Template selection
● determine isomorphism classes
– find templates that can be implemented using the same instruction
– calculate speed-up potential of each class
● solve Knapsack problem using ILP
– maximize speedup within area constraint
11
Optimizing compilation flowApplication in C/C++
Impact Front-end
CDFG Formation
a) TemplateGeneration
b) TemplateSelection
MDESGeneration
Assembly Code and Statistics
InstructionReplacement
Scheduling,Reg. Allocation
Elcor Backend
Gain
Data BandwidthConstraints
Data BandwidthConstraints
AreaConstraints
SynopsysSynthesis
AreaVHDL
12
4. Application-specific processor synthesis
● design space exploration framework– Processor Component Library– specialized structural description
● prototype: MIPS integer instruction set– custom instructions– flexible micro-architecture
● evaluate using actual implementation– timing and area
13
Processor synthesis flow
CustomData paths
from compiler
FE
Processor Component Library
● merging● add state registers● processor interface
● pipeline description● parameters
FE EX M W
interface● data in/out● stall control
Custom Processor
14
Implementation
● based on Python scripts
– structural meta-language for processors
– combine RTL (Verilog/VHDL) IP blocks
– module generators for custom units
● generate 100s of designs automatically
– ASIC processor cores
– complete system on FPGA: CPU + memory + I/O
15
5. Results● cryptography benchmarks: C source
– AES decrypt, AES encrypt, DES, MD5, SHA
● 4/5 stage pipelined MIPS base processor– 0.225mm2 area, 200 MHz clock speed– single issue processor– register file with 2 input ports, 1 output port
● processors synthesized to 130nm library– Synopsys DC and Cadence SoC Encounter
– also synthesize to Xilinx FPGA for testing
16
AES Decryption Processor
130nm CMOS200MHz0.307mm2
35% area cost(mostly one instruction)
76% cycle reduction
17
AES Decryption Processor
130nm CMOS200MHz0.307mm2
35% area cost(mostly one instruction)
76% cycle reduction
18
Execution time
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70
Area constraint (ripple carry adders)
Nor
mal
ised
num
ber o
f Cyc
les
AES decryptAES encryptDESMD5SHA
4 inputs, 1 output
4 inputs, 1 output
4 inputs, 4 outputs
4 inputs, 2 outputs
4 inputs, 1 output
76% reduction
63% reduction
43% reduction
Register file in all cases: 2 input ports, 1 output port
19
Timing
• 48% of designs meet timing at 200MHz without manual optimization
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
50000 60000 70000 80000 90000 100000 110000 120000
Cell area/mm2
Sla
ck/n
s
20
Area (for maximum speedup)
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
aesdec
aesenc
des md5 sha basecpu
Are
a/m
m2
Cell area
Chip area35% 28%
42%
93%
23%
21
6. Current and future work● support memory access in custom instructions
– automate data partitioning for memory access– automate SIMD load/store instructions for state registers
● use architectural techniques e.g. shadow registers– improve bandwidth without additional register file ports
● study trade-offs for VLIW style– multiple register file ports– multiple issue and custom instructions
● extend compiler: e.g. ILP model for cyclic graphs– adapt software pipelining for hardware
22
Summary● complete flow from C to custom processor
● automatic instruction set extension– based on integer linear programming– optimize schedule length under constraints
● application-specific processor synthesis– complete flow: permits real hardware evaluation
● up to 76% reduction in execution cycles– 3x area delay product reduction
● max speedup: 23% to 93% area overhead