lec wolf slides
Post on 29-May-2018
222 Views
Preview:
TRANSCRIPT
-
8/9/2019 Lec Wolf Slides
1/55
Smart Cameras as HighPerformance Embedded
Systems
Wayne Wolf, Tiehan Lv,
Burak Ozer, Jason Fritts
-
8/9/2019 Lec Wolf Slides
2/55
Outline
Problem and methodology.Ozer, Lv, Wolf: smart camera system.
Ozer, Lv, Wolf: Optimizing the smartcamera software.
Wolf, Lv, Ozer: Architectures for smart
cameras.
Xu, Wolf: Wave pipelining for NoCs.
-
8/9/2019 Lec Wolf Slides
3/55
What is video processing?
Initial steps operate on pixels, aredominated by data.
Later steps operate on other types of
data:
Smaller data volumes.
Wider variety of data types.More variation in control flow, run time.
-
8/9/2019 Lec Wolf Slides
4/55
Multimedia requirements
Complex algorithms:multiple phases;
data and control.
Todays applications: compression.
Tomorrows applications: analysis.
-
8/9/2019 Lec Wolf Slides
5/55
The multimedia processing
funnel
Data
volumeData
abstraction
pixel processing
principal component analysis,
hidden Markov models
Edge extraction
-
8/9/2019 Lec Wolf Slides
6/55
-
8/9/2019 Lec Wolf Slides
7/55
Questions
Measurement:
What do we measure?
On whatimplementation do wemeasure it?
How accurate do ourmeasurements have tobe?
Architecture:
What uniprocessorarchitecture is best?
Do we need amultiprocessor?
How do we balanceprogrammability withother goals?
-
8/9/2019 Lec Wolf Slides
8/55
The Parapet Project
Goal: design SoC networks for real-timedistributed vision.
The best way to get a good design example
is to create our own.Video is a high-performance, low-power,
cost-sensitive application.
Vision is an important problem.
-
8/9/2019 Lec Wolf Slides
9/55
Parapet goals
Algorithms (gesture recognition).How do we adapt algorithms to the needs of real-
time embedded video.
Distributed systems.Communicating cameras.
Embedded software.Middleware, code optimization.
SoC architecture.Heterogeneous multiprocessors.
-
8/9/2019 Lec Wolf Slides
10/55
Smart cameras for smart
rooms
Coordinated cameras track subject:
-
8/9/2019 Lec Wolf Slides
11/55
Why multiple cameras?
Helps with occlusion.
Can synthesize new views:
Pan/focus of attention.
Zoom/resolution.
Aperture.
-
8/9/2019 Lec Wolf Slides
12/55
Why distributed?
Centralized server is impractical:
Network bandwidth.
Network power consumption.
Latency for system adjustements (lighting,etc.)
Smart cameras plays into strengths ofVLSI---increases volume.
-
8/9/2019 Lec Wolf Slides
13/55
Ozer et al: human activity
recognition algorithm
Regionextraction Contourfollowing
Ellipse
fitting
Graph
matching
HMM head
HMM body
HMM hand1
HMM hand2
Gestureclassifier
-
8/9/2019 Lec Wolf Slides
14/55
Real-time analysis
-
8/9/2019 Lec Wolf Slides
15/55
Original
Region finding Ellipse fitting
-
8/9/2019 Lec Wolf Slides
16/55
Tuning the smart camera
software
Initial C/Trimedia was direct translationfrom Matlab.
Goals:
Increase frame rate.
Reduce latency.
Identify bottlenecks for next-generationarchitecture.
-
8/9/2019 Lec Wolf Slides
17/55
Real-time vs. just fast
Real-time computing adheres toconstraints:
Must perform at a given rate.
To satisfy the rate, must minimize variationsin processing time.
-
8/9/2019 Lec Wolf Slides
18/55
Stage times before
optimization
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
ProcessingTime(%)
Region Contour Super Match
-
8/9/2019 Lec Wolf Slides
19/55
Smart camera CPU times
0 5 10 15 20 25 30 35 40 4526.5
27
27.5
28
28.5
29
29.5
Frame number
Processingtime
(msec)
Skin Detection
0 5 10 15 20 25 30 35 40 4565.6
65.8
66
66.2
66.4
66.6
66.8
67
67.2
Frame number
Processingtime(msec)
Contour Following
Skin detection Contour detection
67.2 ms29.5 ms
-
8/9/2019 Lec Wolf Slides
20/55
Smart camera CPU times,
contd.
0 5 10 15 20 25 30 35 40 4550
100
150
200
250
300
350
Frame number
Processingtime(msec)
Superellipse Fitting
0 5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
35
Frame number
Processingtime(msec)
Graph Matching
Superellipse fitting Graph matching
35 ms250 ms
-
8/9/2019 Lec Wolf Slides
21/55
Normalized standard
deviation of stage times
0
0.1
0.2
0.3
0.4
0.50.6
0.7
0.8
0.9
region contour super match all
-
8/9/2019 Lec Wolf Slides
22/55
Optimizations
Change the algorithm.
Change the program structure.
Change the instructions.
-
8/9/2019 Lec Wolf Slides
23/55
Algorithmic changes
Superellipses were expensive to fit andoverkill.
Replaced with ellipse fitting.
Improved adjacency algorithm.
-
8/9/2019 Lec Wolf Slides
24/55
Region finding
Operates on 3 x 3 window.
Roughly linear in frame size.
Sequential algorithm---window moves onepixel per step.
-
8/9/2019 Lec Wolf Slides
25/55
Program changes
Contour fitting is very control intensive:
Compares local configurations of bits.
Transformed into data-parallel operations
for VLIW: 011
0
1
0
1
1
107 Table
Control-orientedData-oriented
-
8/9/2019 Lec Wolf Slides
26/55
Instruction changes
Trimedia provides library of intrinsicfunctions that map onto Trimediainstruction sequences.
Goal: eliminate branches.Special instructions.
Loop unrolling.
-
8/9/2019 Lec Wolf Slides
27/55
Before and after stage
times
0
10
20
30
40
50
60
Processing
Time(ms
Region Contour SuperFit Match Total
Original Optimized
-
8/9/2019 Lec Wolf Slides
28/55
Results
Before: 5 frames/sec.
After: 31 frames/sec w/o HMM, 25frames/sec with HMM.
Latency approx. 100 ms.
Smaller variation in frame processing
time.Intel port runs well.
M lti
-
8/9/2019 Lec Wolf Slides
29/55
Multiprocessor
architectures for video
Interested in high-speed videoprocessing.
150 frames/sec.
Want reasonably low-power operation forpervasive applications.
-
8/9/2019 Lec Wolf Slides
30/55
High-speed smart cameras
High frame rates provide better motioncapture.
Frame rate of 150 frames/sec is
considered desirable.
Stanford CMOS camera can digitize at
10,000 frames/sec.
Wh h t
-
8/9/2019 Lec Wolf Slides
31/55
Why heterogeneous
architectures make sense
Data
volumeData
abstraction
pixel processing
principal component analysis,
hidden Markov models
Edge extractionVLIWRISC
-
8/9/2019 Lec Wolf Slides
32/55
Algorithm flow
Region
extraction
Contour
following
Ellipse
fitting
Graph
matching
HMM head
HMM body
HMM hand1
HMM hand2
Gestureclassifier
-
8/9/2019 Lec Wolf Slides
33/55
Memory structure
Feed-forward data communication.
Not much global memory required.
Allocating memory depends on datavolumes, access patterns, flexibility.
A i ti
-
8/9/2019 Lec Wolf Slides
34/55
Average processing time
by stage
0.00E+00
2.00E+05
4.00E+05
6.00E+05
8.00E+051.00E+06
1.20E+06
1.40E+06
1.60E+061.80E+06
clock_
cycles/frame
region contour ellipse match hmm
algorithm stages
-
8/9/2019 Lec Wolf Slides
35/55
Average IPC by stage
0
2
4
6
IPC
region contour ellipse match hmm
Algorithm Stages
Ti h VLIW
-
8/9/2019 Lec Wolf Slides
36/55
Tiehans VLIW
implementation
Unroll loop to perform multiplecomparisons in parallel.
Pack results into bit vector to address
results table.
Register file, cache provide for reuse of
pixel values.
-
8/9/2019 Lec Wolf Slides
37/55
Contour crawler machine
Hardware implementation of VLIW code:
memory
Q
Q
Q
FSM
-
8/9/2019 Lec Wolf Slides
38/55
Crawler and memory
Crawler performance depends on memorysystem.
Access patterns vary in 2 dimensions:
1 2 3
X 4
567
8
-
8/9/2019 Lec Wolf Slides
39/55
Memory system design
Want to minimize number of partitions toreduce row/column overhead.
Only memory organization that allows for
all parallel accesses is one-word partition.
Assume we fetch one row or column at a
time---3 fetches/cycle.
-
8/9/2019 Lec Wolf Slides
40/55
Single contour crawler
Assuming row/column access pattern,crawler is faster than VLIW by a relativelysmall constant.
-
8/9/2019 Lec Wolf Slides
41/55
Multiple crawlers
Assuming we can patch together contours, we
can start multiple crawlers.
Multiple crawler performance is limited bymemory.Multiple crawlers memory accesses can conflict.
-
8/9/2019 Lec Wolf Slides
42/55
Full-frame SIMD
Can build a large SIMD array with oneprocessor per pixel.
Area*delay:
Speed is roughly constant.
PE is probably about the same size as the
crawler.Not clear it is worth the silicon.
-
8/9/2019 Lec Wolf Slides
43/55
Heterogeneous system
Region:
Stream processor with current algorithm.Stream processor + RISC for others.
Contour:Crawler.
Ellipse:Superscalar/RISC.
Graph:RISC.
-
8/9/2019 Lec Wolf Slides
44/55
Stage pipelining
Significant amount of time in non-streaming
stages.
010
20
30
40
50
60
ProcessingTime(m
s)
Region Contour SuperFit Match Total
Original Optimized
-
8/9/2019 Lec Wolf Slides
45/55
Heterogeneous vs. VLIW
VLIW:
Off-the-shelf IP.
Easy to program.
10 mm2 in 0.13 micron.
Heterogeneous:
Requires more design of blocks, memory.Pipelineable for 2.3X speed-up.
Heterogeneous
-
8/9/2019 Lec Wolf Slides
46/55
Heterogeneous
multiprocessor size
stage PE
area
(mm^2)
background
MIPS32
4Km 0.9
contour custom 0.001
ellipse, graph MIPS64 5Kf 5
total frame processor 5.901
classification MIPS64 5Kf 5
number of frame processors 3
grand total 22.703
-
8/9/2019 Lec Wolf Slides
47/55
Thoughts on networking
Dont want to send all the video all of thetime.Need to send bursts, possibly lower frame
rate or resolution.
Distributed control requires low latency.
Kodak is developing a low-cost wireless
network for ad-hoc camera networks.Consider 802.11 too expensive.
-
8/9/2019 Lec Wolf Slides
48/55
Networks-on-chips
SoC trends:Lots of transistors.
Designs built from
multiple IP blocks.Structured as
heterogeneous
multiprocessors.Why not bundle
interconnect as IP?
CPU CPU
CPU CPU
mem mem
Advantages of NoC based
-
8/9/2019 Lec Wolf Slides
49/55
Advantages of NoC-based
design
More structured wiring---eases physicaldesign problems.
Encourages design reuse of interconnect.
Structures and simplifies applications byproviding communication primitives.
Leverages networked multiprocessordesign knowledge.
Challenges in NoC based
-
8/9/2019 Lec Wolf Slides
50/55
Challenges in NoC-based
design
What networks, protocols, etc?
Physical layer---rich but crummyinterconnect.
Application-specific design: what is worthmodifying and what should be left alone?
-
8/9/2019 Lec Wolf Slides
51/55
Wave pipelining history
L. Cotton introduced wave pipelining in 1969.Used in Cray-1.
Pipelined memory access in Rambus.
Difficult to design.
Multiple data elements in flight.
F1 F2 F3Buffer
Buffer
Pipelining with latch Wave pipelining
Data1 Data2 Data3
F1 F2 F3Buffer
Buffer
Data1 Data2 Data3
-
8/9/2019 Lec Wolf Slides
52/55
Simulation model
Cadence Spectre 10000m interconnection
Inverters are evenly spaced
Worst case and best case 0.25m Al (width, space)=(1.2m, 2.0m)
[KAH98]
Loa
d
Load
Load
wire model
R/3 R/3 R/3
C/6 C/3 C/3 C/6
-
8/9/2019 Lec Wolf Slides
53/55
Simulation results (I)
When inverter size is 100md=379ps, t=226ps
When inverter size reduces to 40mN=3.28, t=325ps On one end, throughput is 70% higher; on the other end,
save about 60% inverter area; in middle, increase
throughput while save area.
0
100
200
300
400
500
600
700
120 110 100 90 80 70 60 50 40 30
Inverter sizes (um)
Delay
s(ps)
d t
-
8/9/2019 Lec Wolf Slides
54/55
Simulation results (II)
When inverter size is 100mEt=20.5pJ/bit, Ew=17.1pJ/bit
When inverter size reduces to 40m
Ew=9.88pJ/bit
Wave pipelining could save energy.
0
5
10
15
20
25
120 110 100 90 80 70 60 50 40 30
Inverter sizes (um)
Averageenergiespe
rbit(pJ/bit) Ew
Et
-
8/9/2019 Lec Wolf Slides
55/55
Summary
Multimedia applications are already morecomplex and will become more so:
multiple algorithms;
complex control and data.
Instruction-level parallelism helps, but
isnt everything.
top related