2431 socd 08 optimization hw es08 - tut · dsf full custom asic ... e.g. floating point dct 200...
TRANSCRIPT
Erno Salminen - Oct. 2007
TKTTKT--2431 Soc 2431 Soc DesignDesign
Lec 8 Lec 8 –– OptimizationOptimization
Erno SalminenErno Salminen
Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology
Fall 2008Fall 2008
Erno Salminen - Oct. 2007#2/47
Copyright noticeCopyright notice
Part of the slidesadapted from slide set by Alberto Sangiovanni-Vincentelli
course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml
Part of figures from:J. Heikkinen, J. Sertamo, T. Rautiainen and J. Takala, "Design of Transport Triggered Architecture Processor for Discrete Cosine Transform", in Proc. 15th Ann. IEEE Int. ASIC/SOC Conf., Rochester, NY, U.S.A., Sept. 25-28 2002, pp. 87-91
Erno Salminen - Oct. 2007#3/47
At firstAt first
Make sure that simple things work before even trying more complex ones
Erno Salminen - Oct. 2007#4/47
OutlineOutline
Determine bottlenecks - Amdahl’s lawMethods
Architectural choicesAlgorithm modifications, assembly codingCustom processorsHW accelerators
Erno Salminen - Oct. 2007#5/47
ForewordForeword
”Premature optimization is the root of all evil”Donald Knuth [quoting Hoare]
Sutter, Alexandrescu1st rule: Don’t optimize2nd rule (for experts only): Don’t do it yet. Measure twice, optimize once.
Focus on making code as clear and readable as possibleOptimizations make design and code more complex Optimize only when performance bottle-neck has been proven (and identified)
Erno Salminen - Oct. 2007#6/47
System bottlenecks (1)System bottlenecks (1)
[H. Meyr, Application Specific Instruction-Set
Processors for Wireless Communications, Tampere
SoC, Nov. 2004]
[Berkeley Design Technology Inc., Alternatives to DPSs: What and Why?, Tampere SoC, Nov. 2003]
Determine what’s taking timeOr area, power, memory
Erno Salminen - Oct. 2007#7/47
System bottlenecks (2)System bottlenecks (2)
Concentrate optimization on bottlenecksNo use of optimizing part that takes small fraction, say 3%, of the execution time
Trivial Matlab exampleRemoved one unnecessary #include from m files12x speedupLocating bottleneck took few hoursFixing the bottleneck took 1 minute
System may be refined into smaller blocks to define the bottlenecks in logic area or propagation delay
Otherwise, it is difficult to determine the relation between HDL source line and schematic
Erno Salminen - Oct. 2007#8/47
AmdahlAmdahl’’s Laws Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
exc.
tim
e
[H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]
HUOM! OBS!
Muy importante!
Erno Salminen - Oct. 2007#9/47
AmdahlAmdahl’’s Law Examples Law Example
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Max. speedupoverall = 1 / (1- fractionenhanced)
Speedupoverall = 10.95
= 1.053
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Erno Salminen - Oct. 2007
Architectural choicesArchitectural choices
Erno Salminen - Oct. 2007#11/47
Architectural choicesArchitectural choiceslo
g Fl
exib
ility
log
Flex
ibili
ty
log Efficiency log Efficiency (increasing speed, (increasing speed, decreasing power and area)decreasing power and area)
FPGA
micro-
processor
Data+instr
mem
General purpose
microprocessor MAC
micro-
processorAddr
gen
Data+instr
mem
SW programmable
DSP
co-
proc
micro-
processorco-
proc
Data+instr
mem
Hardware
reconfigurable
processor
Dream solution(exists only in
marketing material...)
Dream solution(exists only in
marketing material...)
Direct mapped HW
std. cell
ASIC
dsf
full custom
ASIC
Erno Salminen - Oct. 2007#12/47
Heinrich Meyr, Future Wireless Communication Systems…, VTC, 2005.
(Figure data by T.Noll T.Noll, RWTH Aachen)
http://www.ieeevtc.org/vtc2005spring/presentations/2020_presentations/HMeyr.pdf
General-purpose CPU
DSP
FPGA, ASIP
std-cell ASIC
full custom ASIC
General-purpose CPU
DSP
FPGA, ASIP
ASIC
Erno Salminen - Oct. 2007#13/47
Architectural choices (2)Architectural choices (2)
Area and energy efficiencies of comparable MPEG-4 encoder implementations (bigger the better)
,[Mpixels/s/mm2]
,[Mpixels/s/W]
[O. Silven and K. Jyrkkä, Observations on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, Vol 2007, Article ID 56976, 10 pages, 2007.]
dream solution
Values include RAM.
Erno Salminen - Oct. 2007#14/47
ASIC versus PLD/FPGA Design StartsASIC versus PLD/FPGA Design Starts
0
1000
2000
3000
4000
5000
6000
2001 2002 2003 2004
ASIC Design Starts
Source: Gartner Group0
100000
200000
300000
400000
500000
600000
2001 2002 2003 2004
PLD/FPGA Design Starts
Source: Gartner Group
“ASIC design starts will decline 12.3 percent to 4,345 this year following the precipitous 36 percent drop in design starts in 2001”
(B. Lewis, Gartner Dataquest, 10/28/02)
PLD/FPGAs are becoming more and morethe driving force in microelectronicstechnology, CAD tools and System-on-Chipdesign.
Erno Salminen - Oct. 2007
Algorithmic Algorithmic modifications, assembly modifications, assembly languagelanguage
Erno Salminen - Oct. 2007#16/47
Algorithm manipulationAlgorithm manipulation
Accelerated function should give identical results with original
Additional conversion functions may destroy all speedupDo not perform over-accurate calculation
Single/double prec. floating-point vs. fixed pointSW emulation of floating point operations is s-l-o-wE.g. floating point DCT 200 kcycles, fixed point 15 kcyclesHW FPUs are big: ~5.7 mm2 @0.35 um [Brunelli, TreSoc04],
~120 kgates (compare to RISC core ~50 kgates)Fixed point is less accurate
Word width optimizationEspecially on HWOn CPU, smallest is not necessarily fastest
Using type char may require additional shift/AND/ORinstructions
Erno Salminen - Oct. 2007#17/47
Example: SortingExample: Sorting
Simplest algorithms have O(n2) execution timeMore complex O(n log n)
Require recursion, advanced data structures, and multiple arrays
Recursion may lead to stack overflowMultiple arrays require big memoryFig: http://linux.wku.edu/~lamonml/algor/sort/sort.htmlP.S. Avoi light-colored lines( e.g. yellow). use markers
bubble
selection
insertion
shell
900
heapmerge
quick
0.7
Erno Salminen - Oct. 2007#18/47
Algortihm: Sacrificing qualityAlgortihm: Sacrificing quality
[Ramchan Woo, Tampere Soc, Nov. 2004]
Decrease data width
Erno Salminen - Oct. 2007#19/47
Assembly coding (1)Assembly coding (1)
Try assembly only if everything else failsKeep also the high-level language (HLL) version to allow portability and reuse
Sometimes required with special instructions Such as interrupt handling, MMX, processor mode (user/supervisor)
Speedup with RISC procecssors not that great
Usually only one execution unit(Few) instructions, simple addressingDecent compilers available
Erno Salminen - Oct. 2007#20/47
Assembly coding (2)Assembly coding (2)
DSPs most likely benefit from assemblyTight loopsComplex micro-architecture is difficult for compiler
“Latest Compilers fall short of hand-optimized performance substantially even for DSP Kernels”
[Naji S. Ghazal et al., Retargetable Estimation for DSP, Architecture Selection, Tampere Soc, Nov. 1999]
Erno Salminen - Oct. 2007#21/47
Optimization impactOptimization impactRISC = estimated number of required basic ”RISC” operationsfm = fitting coefficient = measured_cycles / estim_RISC_ops N.O = no optimizationH.O. = hand optimizedO. Lehtoranta, PhD Thesis, TUT 2006
[O. Lehtoranta, PhD Thesis, TUT 2006]
Erno Salminen - Oct. 2007#22/47
Assembly example: vector copy, B[] = A[]Assembly example: vector copy, B[] = A[]
First versionstart_copy:ld r1, [r2] // r2 is src addr, A[i]st [r3], r1 // r3 is dst addr, B[i]inc r2inc r3dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_copy // loop back if needed
Secondld r1, [r2]inc r2st [r3], r1and so on ...
Increment does depend on r1 and stall is avoidedLoad could be performed just before branch
Load delay happens during pipeline stall
Load causes pipeline stall if next instruction depends on loaded value
Erno Salminen - Oct. 2007#23/47
Assembly example: delayed branchAssembly example: delayed branch
Fig 2. ’Normal’ branch
Fig 3. Delayed branch
Two instr. (i3 +i4) following the branch are also executed
Addr Instruction
a1 i1: MR=MR+MX0*MY0 (SS);
a2 i2: IF COND JUMP aa1;
a3 i3
a4 i4
a5 i5
a6 i6
a7 i7
... ...
aa1 ii1
[http://www.analog.com/UploadedFiles/Application_Notes/587795865ee_123.pdf]
four-cycle stall two-cycle stall
Erno Salminen - Oct. 2007
Custom processors Custom processors (ASIPs)(ASIPs)
Erno Salminen - Oct. 2007#25/47
Custom processorsCustom processorsAllow using C/C++ compilationASIP = Application Specific Instruction set ProcessorExtend CPU with application (domain) specific instructions
MAC, sum with clipping, DCT etc.Extension tightly coupled with CPU pipelineOptimize internal communication within CPU
Remove unnecessary instructionsOtherwise configure CPU (num of registers, data width...)
Erno Salminen - Oct. 2007#26/47
Custom processor performance (1)Custom processor performance (1)Tensilica XtensaKernel speed-up 6x – 100x
Depends heavily on applicationBase CPU ~20 000 gates
HW overhead 20% - 150%
[Monica Lam, Compiler Technology for Configurable Processors, Tampere SoC, Nov. 2001.]
Disclaimer: heavy marketing contentDisclaimer: heavy marketing content
Erno Salminen - Oct. 2007#27/47
Custom processor performance (2)Custom processor performance (2)
[Yasmin Oz et al.,Galois Field Instruction Set Accelerator in the StarCore SC140 DSP, Tampere SoC, Nov. 2001.]
Reed-Solomon decoding cycle count
Speedup 22.1 14.5 6.3 1.0
SC140 = original Star Core DSPGFISA = special instructions for Galois field operations added
HW overhead ~10%Special ISA does not help every algorithm!
runt
ime
=t(sc140)t(gfisa)
Erno Salminen - Oct. 2007#28/47
Custom processor performance (3)Custom processor performance (3)
Beneficial also for energy
[H. Meyr, Application Specific Instruction-Set
Processors for Wireless Communications, TreSoC 2004
Note: E= P * t
(6.1x speedup)
(8.0x speedup)
Erno Salminen - Oct. 2007#29/47
Transport Triggered Architecture (TTA)Transport Triggered Architecture (TTA)
Application-specific processorMore flexible than HWStill allows programmabilityAlmost the same performance as ASIC
MOVE design framework allows (semi)automatic exploration
Number of execution unitsConnections between unitsMany trade-offs between area and performanceMany proposed custom CPUs use manual exploration
Resembles VLIWEverything scheduled at compile-time
Designer gives C code and restrictions to exploration toolTools generate synthesizable VHDL
Erno Salminen - Oct. 2007#30/47
TTA (2)TTA (2)C compiler automatically configured to new micro-architecture
Distinctive factor to many CPUs
One instruction: move, e.g. ”Add r2, r3, r3:
move reg[2] -> ALU.op1
move reg[3] -> ALU.trig
move ALU.result -> reg_file [2]
TTA allows more freedom in code scheduling than traditional CPUs
But suffers from larger code size
Erno Salminen - Oct. 2007#31/47
TTA (3)TTA (3)
Better area and performance than general purpose RISCSpecial function unit (SFU)
added manuallyincreases areadecreases ex.time
For certain algorithms, same cycle counts as ASIC may achieved
ASIC has bigger frequencyCurrently, developed also at TUT
Interested students may do project work on TTA
Erno Salminen - Oct. 2007#32/47
Area vs. runtime tradeArea vs. runtime trade--offoff
TTA’s cycle count smaller than RISC, close to ASIC
TTA’s area between ASIC and RISCASIC has highest frequency
(memory excluded) (memory excluded)
[Hämäläinen, Euromicro DSD, 2005]RC4 exploration
Erno Salminen - Oct. 2007
HW acceleratorsHW accelerators
Erno Salminen - Oct. 2007#34/47
HW accelerators (1)HW accelerators (1)
Favor: highest performance, smallest area and power Against: longest design time, narrow application domainDo not require code memory like progammable processors (CPU, ASIP, DSP)Example: 8x8 DCT
D: [J. Nikara, Application-Specific Parallel Structures for Discrete Cosine Transforn and Variable Length Decoding, PhD thesis, TUT, June 2004]
# Type um Cycles Area Speedup (in cycle count) Freq [MHz] Max perf
[blocks/s]
Perf/area [blocks/s /
gates]A RISC (ARM9) 0.18 2660 190 kilogates + mem 1.0 160 60 M 0.32
B ASIP (TTA+SFU) 0.13 538 56 kilogates + 34 kilogates mem
4.9 250 464 M 5.16
C HW (by student) 0.18 250 44 kilogates 10.6 182 728 M 16.55
D HW (by PhD) 0.11 9439 kilogates + control
logic 29.3 253 2691 M 69.01
Erno Salminen - Oct. 2007#35/47
HW accelerators (2)HW accelerators (2)
Regular, data-flow type functions most suitable for HWCommunication between CPU and HW critical
Delay, mutual exclusion, pipelining
CPUCPU only CPU CPU
CPUCPU + HW v.1
HW
CPU communication overhead reduces
the overall speedup
4x speedup
CPU + HW v.2
CPU
HW
CPU CPUpipeline
HW
Erno Salminen - Oct. 2007#36/47
HW accelerator (5)HW accelerator (5)
CPU 1CPU 1 I+D memI+D
mem
accel 1
accel 1
on-chip networkon-chip network
network IF
network IF
network IF
network IF
accel
3
accel
3
accel 2
accel 2
CPU 2CPU 2 I+D memI+D
mem
local, private acc.
remore, shared acc.
Erno Salminen - Oct. 2007#37/47
HW accelerators (3)HW accelerators (3)Orig SW:for i=0:N loop
load r1, [r2]add, sub, mul, cmp, beq, other processingst r1, [r3]
end loop
SW + HW, straightforward pollingstart_hw()while (hw_ready==0) {}for i=0:N loop
load r1, [r2]end loop
SW + HW, pipelinedstart_hw()other_function_x();while (hw_ready==0) {}for i=0:N loop
load r1, [r2]end loop
Measured SW ex.time includes loading input values and storing the results
Even if HW does processing much faster, data transfers from CPU to HW must be taken into account
Function X executed in parallel with HW. Less time wasted in polling (but still polling)
polling =busy wait
Erno Salminen - Oct. 2007#38/47
HW accelerators (4)HW accelerators (4)Polling vs. interrupts
Interrupts allow more efficient parallel executionCPU controlled transfers vs. DMA
CPU transfer all the data, time O(n), 7 cycles/wordstart_copy:ld r1, [r2] // r2 is src addrst [r3], r1 // r3 is dst addrinc r2inc r3 // dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_sopy // loop back if needed
CPU just inits DMA controller, time O(1), DMA 1 cycle/wordstart_dma:st #DMA_SRC_ADDR, r1st #DMA_DST_ADDR, r2st #DMA_AMOUNT, r4do_other_stuff()...
Erno Salminen - Oct. 2007#39/47
HW optimization (1)HW optimization (1)Reuse benefits from configurability and many parameters
Run-time configurability is often costlyGood for simulation-based testing
Convert input signals into generics for synthesisTurn unwanted features off to save area and power
Perhaps increases the max freq alsoif enable_g = ’1’ then <code>;
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
we=0, re=0 we=0, re=1 we=1, re=0 we=1, re=1
rom ram
Memory type
Con
figur
atio
n m
emor
y ar
ea [g
ates
]
No slots 1 slot 2 slots
Example: config memory inside bus wrapper
2 generics1. we= write enable2. re = read enable
optimize according to application
Erno Salminen - Oct. 2007#40/47
HW optimization (2)HW optimization (2)
Try to design HW so that propagation delay is not (linearly) dependent on data width
Scalable solutionBad example: if data < 55 then data<= data+1;Better: if data /= 55 then data<= data+1;
Turn on boundary optimizationLogic in different entities optimized together
block Bblock B
block A
(If output uses < 16 of all possible values)
block A
(If output uses < 16 of all possible values)
4b(This can be opitmized)
(This can be opitmized)Note: combinatorial outputs not recommended
E.g. inverters can be removed
Restricted value set in output
Erno Salminen - Oct. 2007#41/47
HW optimization (3)HW optimization (3)
Minimize the data width of signalsRemove unnecessary flip-flops (á 4-6 eq.gates)
i.e. those with constant output DC: set compile_seqmap_propagate_constants true
Optimizes also the logic after the flip-flop
always 1
always 0
By default, synthesis does NOT remove any registers
All signals that are assigned in sequential process (clk, rst_n) produce a flip-flop
Flip-flop with constant output
propagated constant
Erno Salminen - Oct. 2007#42/47
HW optimization (4)HW optimization (4)
real logicreal logic
”debug value”
unnecessary mux
Do not ’reset’ registers when value is not needede.g. if valid_in = ’0’ then data_r <= (others =>’0’);
Unncecessary input MUXGood for visualization in simulation thoughif dbg_enable_g = ’1’ then reg <= dbg_value;
Easy to see when these are valid
Validity determined according to signal empty
Erno Salminen - Oct. 2007#43/47
HW optim: Aim at HW optim: Aim at ””fast enoughfast enough””Do not overoptimize HW, if performance limit is known
100 frames/sec encoder is not better than 25 fps enc, if camera restricts the frame rate anyway
Minimizing critical path, causes large areaRequires larger drive strength for gates They also have higher leakage currents
area
speed a:[1/cycles]
b:[MHx]
Minimizing cycle count needs many parallel sub-blocks (e.g. ALUs)Consider the integration overheads also
Erno Salminen - Oct. 2007#44/47
””Fast enoughFast enough””: Real data: Real dataImplementing low-power configurable processors - practical options and
tradeoffs, Wei, J.; Rowen, C.; Design Automation Conference, 2005. Proceedings. 42nd,13-17 June 2005 Page(s):706 - 711
Erno Salminen - Oct. 2007#45/47
C/C++ based HW designC/C++ based HW design”Do not need HW designers anymore as SW designer can do everything”
Not exactly true...SystemC
Good for simulationMany problems with synthesis currentlyHW oriented SystemC cannot be compiled as SW anymore
Catapult C by MentorPromising approachPure C:
No timingInterfaces defined in synthesis tool
Best idea in Catapult CSame description can be compiled and synthesizedBlock level design – not for large systems
Not practical in large scale currently
Erno Salminen - Oct. 2007#46/47
C/C++ based HW design (2)C/C++ based HW design (2)
[Ramani, Haggard, Southeastern Symposium on System Theory, 2001]
Erno Salminen - Oct. 2007#47/47
ConclusionConclusion
Remember Amdahl’s law – concentrate on appropraite parts of the systemASIPs provide great improvements but allow programmabilityCommunication between components has great impact on perfromance
Use interrupts and DMA controllersPipeline SW and HW