case study: performance-efficient implementation of robust ......1 © 2016 synopsys, inc. all rights...
TRANSCRIPT
1
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Case study:
Performance-efficient Implementation of
Robust Header Compression (ROHC)
using an Application-Specific Processor
Gert Goossens, Patrick Verbist,
Erik Brockmeyer, Luc De Coster
Synopsys
2
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
3
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
ROHC in Network Processing
ROHC compressor
• 1.2 Mpackets/s
• 600MHz clock 500 cycles/packet
− Header Parser: ~100 cycles/packet
− Encoder+Context+CRC: ~400 cycles/packet
• Optimize for worst-case control path
High Performance Streaming Data (IP/UDP/RTP Protocol)
IP Header20-40 bytes
UDP Hdr8 bytes
RTP Header12 bytes
Payload Video/Audio…
ROHC Header Payload Video/Audio…
ROHC Compressor
ROHC Decompressor
Radio or
Cable Link
Header Parser
Header Field Encoder
Packet Modification
Buffer
Feedback Buffer
Context Processor
CRCCon-TextMem
4
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Header Parser
Header Field Encoder
Packet Modification
Buffer
Feedback Buffer
Context Processor
CRCCon-TextMem
ROHC Implementation
█ Blocks requiring efficient control-flow
Tiny microprocessor with efficient branching and logic operations
█ Blocks requiring efficient control-flow and data processing
Tiny microprocessor with hardware-accelerated instructions
ASIP technology enables the design of such processors
5
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
6
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
ASIPs in SoC DesignASIP architectural optimization space
Parallelism Specialization
Instruction-level
parallelism
Data-level
parallelism
Task-level
parallelism
Orthogonalinstructionset (VLIW)
Encoded instruction
set
Vector processing
(SIMD)
Multi-core
Applic.-specific
data types
Applic.-specific
instructions
Connectivity & storage matching
application’s data-flow
App.-spec. data
processing
App.-spec. memory
addressing
App.-spec. control
processing
Distributed regs,
sub-ranges
Multiple mem’s,
sub-ranges
Jumps, subroutines,interrupts, HW
do-loops, residual control, predication
Direct, indirect, post-modification,
indexed, stack indirect…
Any exoticoperator
Integer, fractional,
floating-point, bits, complex,
vector…
Single or multi-cycle
Relative or absolute, address range,
delay slots
Pipeline
Multi-threading
Pipelinedepth
Hazards:HW/SW stall,
bypass
Micro-processor
Extensible Processor
Application-Specific uP / DSP
ProgrammableDatapath
HardwiredDatapath
7
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
“ASIP Designer” Tool-Suite
8
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
9
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
• Architectural exploration with ASIP Designer
• Starting point: “Tmicro” CPU– 16-bit gen.-purpose CPU (already leaner than 32-bit)
– Variable-length instructions: arithmetic (16), move (16, 32), load/store (16, 32), control (16, 32, 48)
Customization of a 16-bit CPU: “Strip Down & Beef Up”
• End point: “Tnano” ASIP– 16-bit stripped CPU
– Fixed-length instructions: arithmetic, move, load/store, control (16)
– No multi-word decoding overhead
– Improved clock frequency
– Add compact control instructions to accelerate ROHC code
– Predicated execution (Selection)
– Field extraction (Masking)
– Shortcut logic instructions
10
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control ProcessingControl Path Balancing
Longest control path
Shortest control path
• Example: Control-Flow
Graph of Header Parser
• Improve control path
balancing by
– C source code
re-factorization
– User-control on code
hoisting
– Predicated execution
in tail of long control
paths
11
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control ProcessingIf-Else, No Predication Tmicro (gen.-purp. CPU)
nML
Conditional jump instruction,
2-cycle branch penalty
C
Condition at tail of long
control path Machine code
Conditional jump with branch penalty:
One of two delay slots filled, one
‘nop’ left
12
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control ProcessingPredication Tnano (optimized ASIP)
nML
Select instruction
C
Condition at tail of long
control pathMachine code
• Conditional code executes always
• Result is used selectively No branch penalty
nML
Predication Threshold
13
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control ProcessingIf-Else with Multiple Tests Tmicro (gen.-purp. CPU)
nML
Stand-alone compare instruction
C
“If-else” with multiple tests
Machine codeMultiple compare and c-jump
instructionsSlow in worst-case
14
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control ProcessingIf-Else with Multiple Tests Tnano (optimized ASIP)
nML
“Compare + shortcut-logic”
instruction
CND &= Rj==Ri
CND |= Rj!=Ri
C
“If-else” with multiple tests
Machine code
• Multiple “compare + shortcut-logic”
• Single c-jump
Worst case is always faster!
15
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
Tmicro CPU Tnano ASIP
Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%)
Rohc_parse cycle count per packet 191 87 (-55%)
Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%)
Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%)
Results – Header Parser
16
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
17
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data Processing
• Implementation styles
– Software on processor: too slow?
– Hardware co-processors: (manual) design effort, synchronization challenge?
– Hardware-accelerated instructions in ASIP instruction set: well supported by tools, potential for resource sharing!
Header Parser
Header Field Encoder
Packet Modification
Buffer
Feedback Buffer
Context Processor
CRCCon-TextMem
CRC
WLSB encoder
Scaled / Timer-Based RTP
Timestamp Compression
….
18
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data ProcessingWLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU)
nML
General-purpose ALU:add, sub, shift, mask…
CSoftware implementation
of WLSB encoder: for-loop with called function
Machine code
• 30 instructions for called function
• 6-packet test program: 2110 cycles
19
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data ProcessingWLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP)
nML (ISA view)
WLSB encoder instruction, calling hardware primitive
CIntrinsic function call
to WLSB encoder instruction
Machine code• Called function replaced by single instruction
• 6-packet test program: 267 cycles(7.9x speedup)
nML (behavioral view)
• WLSB hardware primitive in bit-accurate C code
• Auto-translated to RTL
20
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data ProcessingResults: Adding HW-Accelerated Instructions
Tmicro
CPU
Tnano ASIP Tnano ASIPw/ WLSB instr
WLSB 6-packet test program
code size
134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%)
WLSB 6-packet test program
cycle count
2122 2110 267 (-87%)
Clock frequency
(28nm HPM)
800 MHz 1 GHz 1 GHz (0%)
Gate count
(core only, 28nm HPM)
14K gates 5.4K gates 6.3K gates (+16%)
21
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
22
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Conclusions
• Application-Specific Processors (ASIP)
– Enable acceleration of control and data processing, similar to
fixed-function hardware
– Flexibility of a software-programmable processor
• ASIP Designer allows to design ASIPs quickly
– Architectural exploration: Compiler-in-the-Loop
– SDK generation
– RTL generation
• Benefits illustrated with Robust Header Compression
(ROHC) case study