high-performance power-efficient solutions for embedded ... · h. tabkhi, m. sabbagh, and g....

Traffic Separation

Challenges of Embedded Vision Contributions

High-Performance Power-Efficient Solutions for Embedded Vision ComputingPhD dissertation of Hamed Tabkhi, PhD adviser: Prof. Gunar Schirner

Department of Electrical and Computer Engineering,

Northeastern University, Boston (MA), USA

{tabkhi, schirner}@ece.neu.edu

C) Communication-Centric Arch. Template

A) Streaming vs Algorithm-Intrinsic

Function-Level Processor

Insight: Not all traffic is equal!

Streaming: - Input/output stream (Independent of algorithm selection)

Algorithm-intrinsic:- Generated by algorithm itself (algorithm dependent)

F) Experimental Results

A) Flexibility/Efficiency

A)Embedded Vision

Application areas- Advanced Driver Assistant (ADAS)

- Security / video surveillance

- Robotics

Rapidly growing- ADAS alone 13x over 5 years

- 2011: $10B -> 2016: $130B

B) Market Requirements- Complex advanced algorithms (Adaptive)

- Diversity of scenes (e.g. indoor, outdoor)

- High res. (1080p) and rate (60fps)

- Significant computation (~50 GOPS)

- Huge bandwidth (~10 GBPS)

- Very low power (~ 1 Watt)

E) Current Approaches

HW solutions for filters- Mid-processing stuck on SW

- Flexible, but inefficient

Cannot handle adaptive algorithms!- Inefficient execution in SW

- Cannot handle heavy traffic

- Low resolutions / quality

Problem: (1) Realize individual adaptive vision algorithm?

(2) Construct single larger vision flow in platform?

(3) Support many vision flows on same platform?

Contributions:

1) Traffic Separation (addresses prob. 1 & 2)

- Manages traffic of adaptive algorithms

- Simplifies chaining of vision algorithms

2) Function-Level Processor (addresses

prob. 3)

- Offers function-level flexibility with

efficiency close to custom-HW

C) Coarse-Grained Vision Pipeline

Pre-Processing (vision filters)

- High but regular compute, limited traffic

Mid-Processing (adaptive)

- High compute, high traffic

Post-Processing (intelligent / control)

- Limited compute / traffic

Processing

Adaptive VisionAlgorithm

Precision Adjustment

ReadDMA

writeDMA

Computation Clock Domain

Communication Clock Domain

Input stream

Output stream

Algorithm-intrinsic data

System Memory

Operational Stream Interconnet

Precision Adjustment

Control Unit (CU)Async. FIFOs Async. FIFOs

Architecture support for traffic separation

- Streaming clock domain (computation)

- Algorithm execution

- Autonomous quality adjustment

- Operational clock domain (communication)

- Dedicated DMAs

- Stream access to memory

- Asynchronous FIFOs bridging clock domains

D) System-Level Benefits

E) Vision SoC Solution on Zynq

MoG Background Subtraction

FG Mask

Gaussian Parameters

PixelStream Component

Labeling

Objects Labels

FG Labels

FG Mask Mean-Shift

Object Tracking

New Positions

Objects Histogeram

ObjectStream

MoGVideo

OverlayHDMI2Gray

AXI Video

FGMask

Processor Subsystem

Object Detection(Component-Labeling)

InputPacking

Async FIFO Async FIFO

OutputUnpacking

Async FIFO Async FIFO

HDMI Clock

Domain

AXI Video

Channel

AXI Bus 0

AXI Bus 1

AXI Clock

Domain

Programmable Logic

Smoothing

Morphology

Object Tracking(Mean-shift)

ErosionDilation

Interc

onnect

Vision Algorithms

Stream In Stream

Algorithm-

intrinsic

Frame Input

Frame Output

Original Scene ForeGround (FG)

Communication

Computation

Stream Pixels

100%static

Async. FIFOs

Adjustment

AXI_data

AXI clk

Vision Algo 2

Vision Algo 1

Vision Algo 0

SystemIn

SystemOut

ProcessorSystem Memory

FLP-PVP

Cache DMA Conf.

B) Programming Abstraction

C) Function-Set Architecture

Low-Pass Filter

(Convolution)DMA

Color/IlluminationExtraction

ILP-BFDSP

E) System-Level Integration

ILP ILP+ACC FLP

Communication

Computation

D) Adaptive Vision Algorithms

Complex scene analysis- Track multiple objects

Machine-learning principles- Keep a model of scene

- e.g. MoG background subtraction, optical

flow, SVM

Observation: Not all traffic is equal!

Algorithm-intrinsic traffic dominates

- 60x in MoG, 20x in Optical flow

Streaming: fixed, algorithm-intrinsic: adjustable

Observed traffic separation in:

• Mixture of Gaussian (MoG)• Kanade Lucas Tomasi (KLT)

optical flow

• Component labeling• MeanShift object tracker

B) Optimization: Compression for Algorithm-Intrinsic DataN Bits

Most Significant

Bits (MSBs)

N Bits

Most Significant

Bits (MSBs)

32 Bits32 Bits

N Bits N Bits00...0 LSBs

MoG Background Subtraction

ParametersIn

PrecisionAdjustment

(N-bit to 32-bit)

PixelOut

PrecisionAdjustment

(32-bit to N-bit)

ParametersOut

PixelIn

Precision-adjustment on algorithm-

intrinsic data access- Bandwidth/quality trade-off

- Pareto front (blue line)

- Evaluate quality [MS-SSIM]

- Significant bandwidth reduction in MoG

- Simple scene: 63%

- Medium scene: 59%

- Complex scene: 56%

- Same trade-off observed for optical flow

Complex Medium Simple

Original Parameters Tuned

Pipeline construction of multiple vision algorithms

- Streaming: point-to-point connection

- Hidden from memory

- Algorithm-intrinsic data to communication interface

- With dedicated precision adjustment

HWSWHW

Morphology MoGSmoothing

(CNV) HDMI

inHDMIout

HistogramChecking

ComponentLabeling

Video Overlay

Object tracking vision flow- Smoothing

- 1x CNV on 8bit data 5x5 window

- Mixture of Gaussian

- Morphology - Dilation, erosion, erosion

- 3x CNVs on 1bit data 15x15 window

- Component labeling

- Histogram checking

- Video Overlay

Implementation results- 1080p 30Hz, or 768p 60Hz

- Limitation: on chip mem.

- Great performance / power

efficiency

- 40 GOPs in 1.7 Watt

- 30x faster then SW-only execution

on desktop machine

Instruction-Level

Processors (ILPs)

- High flexibility, low efficiency

Custom HW Accelerators

(HWACCs)- Low flexibility, high efficiency

Control Processors

Application InstructionFlexibility

tt] FLP

Function

Insight: Mismatch in

Programming Granularity - How to compose a program?

Architecture Granularity- How to execute a program?

Abstraction

Instruction

Function

Filter

SortFunction-

LevelArchitecture

Compiler

Function-Level Processor- Matches abstractions at

function-level granularity

- Architecture for function-

level programming

- Increases efficiency

- Maintains flexibility

- Simplifies application

composition

FunctionA

FunctionB

FunctionE

FunctionI

FunctionB

FunctionD

FunctionE

FunctionF

FunctionH

-Function

FunctionG

FunctionJ

FunctionK

FunctionD

FunctionH

FunctionI

FunctionJ

FunctionK

FunctionB

FunctionE

FunctionF

FunctionJ

Applications of Market (Domain)

FunctionA

FunctionB

FunctionC

FunctionD

FunctionE

FunctionF

FunctionG

FunctionH

FunctionI

FunctionJ

FunctionK

Function Set

Streaming applications composed of

functions

- e.g. OpenCV, OpenSDR

Requirement:

- Compute inside FLP as much as possible

- Identify common functionality and composition rules

D) FLP Architecture

FLP Components:- Optimized Function Blocks (FBs)

- MUX-based interconnect

- Separation of data traffic

- Autonomous control / synchronization

Function0

Function1

Function2

FunctionN-1

FunctionN Ou

Function3

Function5(Arithmetic

Function6

Parameters Buffer/Cache

FLP Streaming-Pipe Controller

Parameters Buffer/Cache Parameters Buffer/Cache

Operational (Algorithm-Intrinsic) Buffer/Cache

Direct-Memory Access (DMA)

System InputInterface

Direct-Memory Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

BackwardMUX

ForewardMUX

Operational (Algorithm-intrinsic) Data Streaming Data

Selected Publications H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET Computers Digital Techniques, vol. 9, no. 1, pp.16–26, 2015. H. Tabkhi, R. Bushey, and G. Schirner, “Algorithm and architecture co-design of mixture of gaussian (mog) background subtraction for embedded vision,” in IEEE 17th Asilomar

Conference on Signals, Systems and Computers, Nov 2013, pp. 1815–1820. ——, “Function-level processor (flp): A high performance, minimal bandwidth, low power architecture for market-oriented mpsocs,” IEEE Embedded Systems Letters, vol. 6, no. 4, pp. 65–

68, Dec 2014. ——, “Function-level processor (flp): Raising efficiency by operating at function granularity for market-oriented mpsoc,” in IEEE 25th International Conference on Application-specific

Systems, Architectures and Processors (ASAP), June 2014, pp. 121–130.

Streaming Communication Fabric

Shared memory

Control Unit

Interrupt line Int Cont

Control Bus

System I/O

FLP pairs with ILP cores- To create complete control and analytic

processing.

- FLP for pre/mid-processing

- ILPs for post-processing (control and

intelligence)

Pipeline Vision Processor (PVP)- FLP is generalization of PVP

- Result of joint work with PVP chief architect

Results on 10 selected vision applications- Computation:

- FLP-PVP <= 22.5 GOPs

- ILP+ACC requires 2 ILP cores

- ILP requires 7 ILP cores

- Off-chip communication:

- FLP offers 5x less than ILP and 3x less than ILP+ACC

- Power:

- FLP offers 18x less than ILP and 5x less than ILP+ACC

FLP Principles:- Target stream processing applications

- Compute contiguously inside FLP

- Limited ILP interaction

high-performance power-efficient solutions for embedded ... · h. tabkhi, m. sabbagh, and g....

Documents

nstx cu apam plasma physics colloquium 2/6/09 – s.a....

mahmoud el sabbagh-portfolio.pdf

nstx iaea fec 2006 pd: s.a. sabbagh 1 supported by office of...

nasibeh teimouri hamed tabkhi gunar schirner summer 2014...

by: joshua sabbagh. who is the leader of the rebellion ? a:...

generalformulaforestimationofmonthlymean...

deaf students and the written language - rinace.net · como...

s.a. sabbagh, j.m. bialek, s.p. gerhardt, r.e. bell, j.w....

nstx us-japan workshop on mhd control 2008 – s.a. sabbagh...

on efficient external on efficient external- on efficient

ifiiit:i'rjsoft ivolume licensing enrollment for education...

anything that you want to know about troponins but never ask...

the implementation of sap’s deduction management component...

modeling, synthesis, verification - cecs · 2009. 7. 8. ·...

foreign rights guide fall 2016 – schirner verlag...whoever...

muna sabbagh (2014), legislative imbalance regarding...

publications.aston.ac.uk€¦ · 2 thesis summary...

check out china-emmanuel sabbagh

maximizing the impact of digitization - strategy& · pdf...

embedded system design · 6/18/2010 · embedded system...