case study: performance-efficient implementation of robust ......1 © 2016 synopsys, inc. all rights...

22
1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys

Upload: others

Post on 19-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

1

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Case study:

Performance-efficient Implementation of

Robust Header Compression (ROHC)

using an Application-Specific Processor

Gert Goossens, Patrick Verbist,

Erik Brockmeyer, Luc De Coster

Synopsys

Page 2: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

2

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Agenda

1. Robust Header Compression (ROHC) in network

processing

2. Application-Specific Processor (ASIP) methodology

3. Accelerating control processing in ROHC

4. Accelerating data processing in ROHC

5. Conclusions

Page 3: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

3

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

ROHC in Network Processing

ROHC compressor

• 1.2 Mpackets/s

• 600MHz clock 500 cycles/packet

− Header Parser: ~100 cycles/packet

− Encoder+Context+CRC: ~400 cycles/packet

• Optimize for worst-case control path

High Performance Streaming Data (IP/UDP/RTP Protocol)

IP Header20-40 bytes

UDP Hdr8 bytes

RTP Header12 bytes

Payload Video/Audio…

ROHC Header Payload Video/Audio…

ROHC Compressor

ROHC Decompressor

Radio or

Cable Link

Header Parser

Header Field Encoder

Packet Modification

Buffer

Feedback Buffer

Context Processor

CRCCon-TextMem

Page 4: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

4

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Header Parser

Header Field Encoder

Packet Modification

Buffer

Feedback Buffer

Context Processor

CRCCon-TextMem

ROHC Implementation

█ Blocks requiring efficient control-flow

Tiny microprocessor with efficient branching and logic operations

█ Blocks requiring efficient control-flow and data processing

Tiny microprocessor with hardware-accelerated instructions

ASIP technology enables the design of such processors

Page 5: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

5

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Agenda

1. Robust Header Compression (ROHC) in network

processing

2. Application-Specific Processor (ASIP) methodology

3. Accelerating control processing in ROHC

4. Accelerating data processing in ROHC

5. Conclusions

Page 6: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

6

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

ASIPs in SoC DesignASIP architectural optimization space

Parallelism Specialization

Instruction-level

parallelism

Data-level

parallelism

Task-level

parallelism

Orthogonalinstructionset (VLIW)

Encoded instruction

set

Vector processing

(SIMD)

Multi-core

Applic.-specific

data types

Applic.-specific

instructions

Connectivity & storage matching

application’s data-flow

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs,

sub-ranges

Multiple mem’s,

sub-ranges

Jumps, subroutines,interrupts, HW

do-loops, residual control, predication

Direct, indirect, post-modification,

indexed, stack indirect…

Any exoticoperator

Integer, fractional,

floating-point, bits, complex,

vector…

Single or multi-cycle

Relative or absolute, address range,

delay slots

Pipeline

Multi-threading

Pipelinedepth

Hazards:HW/SW stall,

bypass

Micro-processor

Extensible Processor

Application-Specific uP / DSP

ProgrammableDatapath

HardwiredDatapath

Page 7: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

7

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

“ASIP Designer” Tool-Suite

Page 8: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

8

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Agenda

1. Robust Header Compression (ROHC) in network

processing

2. Application-Specific Processor (ASIP) methodology

3. Accelerating control processing in ROHC

4. Accelerating data processing in ROHC

5. Conclusions

Page 9: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

9

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control Processing

• Architectural exploration with ASIP Designer

• Starting point: “Tmicro” CPU– 16-bit gen.-purpose CPU (already leaner than 32-bit)

– Variable-length instructions: arithmetic (16), move (16, 32), load/store (16, 32), control (16, 32, 48)

Customization of a 16-bit CPU: “Strip Down & Beef Up”

• End point: “Tnano” ASIP– 16-bit stripped CPU

– Fixed-length instructions: arithmetic, move, load/store, control (16)

– No multi-word decoding overhead

– Improved clock frequency

– Add compact control instructions to accelerate ROHC code

– Predicated execution (Selection)

– Field extraction (Masking)

– Shortcut logic instructions

Page 10: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

10

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control ProcessingControl Path Balancing

Longest control path

Shortest control path

• Example: Control-Flow

Graph of Header Parser

• Improve control path

balancing by

– C source code

re-factorization

– User-control on code

hoisting

– Predicated execution

in tail of long control

paths

Page 11: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

11

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control ProcessingIf-Else, No Predication Tmicro (gen.-purp. CPU)

nML

Conditional jump instruction,

2-cycle branch penalty

C

Condition at tail of long

control path Machine code

Conditional jump with branch penalty:

One of two delay slots filled, one

‘nop’ left

Page 12: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

12

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control ProcessingPredication Tnano (optimized ASIP)

nML

Select instruction

C

Condition at tail of long

control pathMachine code

• Conditional code executes always

• Result is used selectively No branch penalty

nML

Predication Threshold

Page 13: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

13

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control ProcessingIf-Else with Multiple Tests Tmicro (gen.-purp. CPU)

nML

Stand-alone compare instruction

C

“If-else” with multiple tests

Machine codeMultiple compare and c-jump

instructionsSlow in worst-case

Page 14: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

14

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control ProcessingIf-Else with Multiple Tests Tnano (optimized ASIP)

nML

“Compare + shortcut-logic”

instruction

CND &= Rj==Ri

CND |= Rj!=Ri

C

“If-else” with multiple tests

Machine code

• Multiple “compare + shortcut-logic”

• Single c-jump

Worst case is always faster!

Page 15: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

15

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Control Processing

Tmicro CPU Tnano ASIP

Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%)

Rohc_parse cycle count per packet 191 87 (-55%)

Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%)

Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%)

Results – Header Parser

Page 16: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

16

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Agenda

1. Robust Header Compression (ROHC) in network

processing

2. Application-Specific Processor (ASIP) methodology

3. Accelerating control processing in ROHC

4. Accelerating data processing in ROHC

5. Conclusions

Page 17: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

17

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Data Processing

• Implementation styles

– Software on processor: too slow?

– Hardware co-processors: (manual) design effort, synchronization challenge?

– Hardware-accelerated instructions in ASIP instruction set: well supported by tools, potential for resource sharing!

Header Parser

Header Field Encoder

Packet Modification

Buffer

Feedback Buffer

Context Processor

CRCCon-TextMem

CRC

WLSB encoder

Scaled / Timer-Based RTP

Timestamp Compression

….

Page 18: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

18

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Data ProcessingWLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU)

nML

General-purpose ALU:add, sub, shift, mask…

CSoftware implementation

of WLSB encoder: for-loop with called function

Machine code

• 30 instructions for called function

• 6-packet test program: 2110 cycles

Page 19: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

19

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Data ProcessingWLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP)

nML (ISA view)

WLSB encoder instruction, calling hardware primitive

CIntrinsic function call

to WLSB encoder instruction

Machine code• Called function replaced by single instruction

• 6-packet test program: 267 cycles(7.9x speedup)

nML (behavioral view)

• WLSB hardware primitive in bit-accurate C code

• Auto-translated to RTL

Page 20: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

20

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Accelerated Data ProcessingResults: Adding HW-Accelerated Instructions

Tmicro

CPU

Tnano ASIP Tnano ASIPw/ WLSB instr

WLSB 6-packet test program

code size

134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%)

WLSB 6-packet test program

cycle count

2122 2110 267 (-87%)

Clock frequency

(28nm HPM)

800 MHz 1 GHz 1 GHz (0%)

Gate count

(core only, 28nm HPM)

14K gates 5.4K gates 6.3K gates (+16%)

Page 21: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

21

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Agenda

1. Robust Header Compression (ROHC) in network

processing

2. Application-Specific Processor (ASIP) methodology

3. Accelerating control processing in ROHC

4. Accelerating data processing in ROHC

5. Conclusions

Page 22: Case study: Performance-efficient Implementation of Robust ......1 © 2016 Synopsys, Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust

22

© 2016 Synopsys, Inc. All rights reserved.

May 9, 2016

Conclusions

• Application-Specific Processors (ASIP)

– Enable acceleration of control and data processing, similar to

fixed-function hardware

– Flexibility of a software-programmable processor

• ASIP Designer allows to design ASIPs quickly

– Architectural exploration: Compiler-in-the-Loop

– SDK generation

– RTL generation

• Benefits illustrated with Robust Header Compression

(ROHC) case study