IMAGE AND VISION PROCESSING
ON TEGRA K1 Elif Albuz
IMAGE AND VISION USE CASES
Driven by using camera as a sensor
Augmented Reality
Face, Body and Gesture Tracking
Computational Photography and
Videography
3D Scene/Object Reconstruction
MOBILE VISION COMPUTING
Pro
cess
ing D
em
ands
Time
Photography Input = 2D Camera
Processors = ISP + CPU Product = Static Images
Computational Photography Input = MEMS + 2D Camera
Processors = ISP + CPU + GPU Result = Enhance Images and Videos
Mobile Vision Computing Input = MEMS + Depth Camera Processors = ISP + CPU + GPU
Result = Data for advanced user interface and environment modeling
VISION COMPUTING APP CATEGORIES Vision Computing
3D Reconstruction (constructs 3D geometry)
Environmental Feature Tracking Face and gesture tracking Object Reconstruction
Scene Reconstruction
Tracking (constructs positions and motions)
Indoor/Outdoor Positional Tracking Body Modeling
Facial Modeling
Body Tracking
3D Grid with SFM Ped/car detection & tracking
AUGMENTED REALITY
High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence
P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria
HYPER-REALISM
Ray-tracing and light-field
calculations running today on CUDA laptop PC – 50+ Watts
Ongoing research to use depth cameras to reconstruct global
illumination model in real-time
Need on mobile devices at 100x less power = 0.5W
SIMULTANEOUS LOCALIZATION AND MAPPING
WHAT IS NEEDED? Accelerated image & vision processing
— Tegra K1: CPU, GPU, ISP
Camera flexibility and handling of new sensor types
— Android HAL V3, V4L
Low latency routing of image streams to GPU
— EGLStreams
Integrated handling of various sensors and camera
— Global Time Stamps
Effective image & vision programming frameworks
— VisionWorks
TEGRA K1 PLATFORM Desktop GPGPU
features and tools on mobile
GPGPU Compute
CUDA Tools and Libraries
Advanced Graphics
THREE GENERATIONS OF REFINEMENT
Tesla - 2006
Fermi - 2010
Kepler - 2012
TEGRA K1: A MAJOR LEAP FORWARD FOR MOBILE & EMBEDDED APPLICATIONS KEPLER GPU, 192 CORES CUDA 12GB/S BANDWIDTH VIDEO IMAGE COMPOSITOR (VIC)
HD Video Processor 1080p24/30 Video Decode 1080p24/30 Video Encode H.264 | MPEG4 | VC1 | MPEG2 VP8
Kepler GeForce®
GPU w/CUDA
OpenGL-ES nextgen
192 Stream Processors
2D Graphics/Scaling
DAP x5 (12S/TDM)
HDMI eDP/LVDS
ARM
7
Audio
Pro
cess
or Image Processor
25MP Sensor Support ISP 1080p60 Enhanced JPEG Engine
PCIe* G2 x4 + x1
CSI x4 + x4
SATA2 x1 USB 2.0 x3
Security Engine
Display x2
NOR Flash
UART x4 I2C x5
DDR3 Ctlr 64b
800+ MHz
SPI x4 SDIO/MMC x4
28 nm HPM 23x23mm, 0.7mm pitch HS-FCBGA
USB 3.0* x2
Quad Cortex-A15
4x Cores (1+ GHz) NEON SIMD 2 MB L2 (Shared) ARM Trust Zone
Shadow LP C-A15 CPU
CPU
Quad-core A15, NEON
Unified memory, access from CPU and GPU
low-power Shadow core
HD Video Processor 1080p24/30 Video Decode 1080p24/30 Video Encode H.264 | MPEG4 | VC1 | MPEG2 VP8
Kepler GeForce®
GPU w/CUDA
OpenGL-ES nextgen
192 Stream Processors
2D Graphics/Scaling
DAP x5 (12S/TDM)
HDMI eDP/LVDS
ARM
7
Audio
Pro
cess
or Image Processor
25MP Sensor Support ISP 1080p60 Enhanced JPEG Engine
PCIe* G2 x4 + x1
CSI x4 + x4
SATA2 x1 USB 2.0 x3
Security Engine
Display x2
NOR Flash
UART x4 I2C x5
DDR3 Ctlr 64b
800+ MHz
SPI x4 SDIO/MMC x4
28 nm HPM 23x23mm, 0.7mm pitch HS-FCBGA
USB 3.0* x2
Quad Cortex-A15
4x Cores (1+ GHz) NEON SIMD 2 MB L2 (Shared) ARM Trust Zone
Shadow LP C-A15 CPU
4 +1
IMAGE PROCESSOR
2 ISPs, independently programmable
25Mp camera support
(upto 250Mp)
1.2Gp throughput
GPGPU interoperability
HD Video Processor 1080p24/30 Video Decode 1080p24/30 Video Encode H.264 | MPEG4 | VC1 | MPEG2 VP8
Kepler GeForce®
GPU w/CUDA
OpenGL-ES nextgen
192 Stream Processors
2D Graphics/Scaling
DAP x5 (12S/TDM)
HDMI eDP/LVDS
ARM
7
Audio
Pro
cess
or Image Processor
25MP Sensor Support ISP 1080p60 Enhanced JPEG Engine
PCIe* G2 x4 + x1
CSI x4 + x4
SATA2 x1 USB 2.0 x3
Security Engine
Display x2
NOR Flash
UART x4 I2C x5
DDR3 Ctlr 64b
800+ MHz
SPI x4 SDIO/MMC x4
28 nm HPM 23x23mm, 0.7mm pitch HS-FCBGA
USB 3.0* x2
Quad Cortex-A15
4x Cores (1+ GHz) NEON SIMD 2 MB L2 (Shared) ARM Trust Zone
Shadow LP C-A15 CPU
x2
VIDEO ENCODE/DECODE 2X 1920X1080@30FPS CUVID/CUVENC VIDEO ENCODE/DECODE INTERFACE MOTION ESTIMATION ONLY MODE
HD Video Processor 1080p24/30 Video Decode 1080p24/30 Video Encode H.264 | MPEG4 | VC1 | MPEG2 VP8
Kepler GeForce®
GPU w/CUDA
OpenGL-ES nextgen
192 Stream Processors
2D Graphics/Scaling
DAP x5 (12S/TDM)
HDMI eDP/LVDS
ARM
7
Audio
Pro
cess
or Image Processor
25MP Sensor Support ISP 1080p60 Enhanced JPEG Engine
PCIe* G2 x4 + x1
CSI x4 + x4
SATA2 x1 USB 2.0 x3
Security Engine
Display x2
NOR Flash
UART x4 I2C x5
DDR3 Ctlr 64b
800+ MHz
SPI x4 SDIO/MMC x4
28 nm HPM 23x23mm, 0.7mm pitch HS-FCBGA
USB 3.0* x2
Quad Cortex-A15
4x Cores (1+ GHz) NEON SIMD 2 MB L2 (Shared) ARM Trust Zone
Shadow LP C-A15 CPU
HD DEC/ENC PROCESSOR
VIDEO ENCODER Dedicated hardware accelerator
Current frame
Ref. frame
Compressed video (h.264, ..)
Encoder Motion Vectors/Track info
KEPLER Architecture 192 CUDA Cores, SM3.2 ISA Compatible to GeForce, Quadro, Tesla 64kb L1 Cache and Shared Memory 128kb L2 Cache 128 kb Register File
HD Video Processor 1080p24/30 Video Decode 1080p24/30 Video Encode H.264 | MPEG4 | VC1 | MPEG2 VP8
Kepler GeForce®
GPU w/CUDA
OpenGL-ES nextgen
192 Stream Processors
2D Graphics/Scaling
DAP x5 (12S/TDM)
HDMI eDP/LVDS
ARM
7
Audio
Pro
cess
or Image Processor
25MP Sensor Support ISP 1080p60 Enhanced JPEG Engine
PCIe* G2 x4 + x1
CSI x4 + x4
SATA2 x1 USB 2.0 x3
Security Engine
Display x2
NOR Flash
UART x4 I2C x5
DDR3 Ctlr 64b
800+ MHz
SPI x4 SDIO/MMC x4
28 nm HPM 23x23mm, 0.7mm pitch HS-FCBGA
USB 3.0* x2
Quad Cortex-A15
4x Cores (1+ GHz) NEON SIMD 2 MB L2 (Shared) ARM Trust Zone
Shadow LP C-A15 CPU
GPU
WHAT IS NEEDED? Accelerated image & vision processing
— Tegra K1: CPU, GPU, ISP
Camera flexibility and handling of new sensor types
— Android HAL V3, V4L, Camera API
Low latency routing of image streams to GPU
— EGLStreams
Integrated handling of various sensors and camera
— Global Time Stamps
Effective image & vision programming frameworks
— VisionWorks
Camera ISP (Image Signal Processor)
— Little or no programmability
— Data flows thru compact hardware pipe
— Scan-line-based - no global memory
— Best perf/watt
CAMERA IMAGE PROCESSING
~760 math Ops
~42K vals = 670Kb
300MHz ~250Gops
VISUAL SENSOR REVOLUTION Single RGB sensors just the start of mobile visual revolution
— IR sensors – LEAP Motion, eye-trackers
— Active illumination depth sensors – TOF and structured light
Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras
— Stereo pair can enable object scaling and enhanced depth extraction
— Plenoptic Field processing needs FFTs and ray-casting
Hybrid visual sensing solutions
— Different sensors mixed for different distances and lighting conditions
Dual Camera LG Electronics
Plenoptic Array Pelican imaging
Capri Structured Light 3D Camera PrimeSense
Camera HAL v1 focused on simplifying basic camera apps
— Difficult or impossible to do much else
— New features require proprietary driver extensions
— Extensions not portable - restricted growth of third party app ecosystem
Camera HAL v3 is a fundamentally different API
— Flexible primitives for building sophisticated use-cases
— Interface is clean and easily extensible
— Apps can have more control, and more responsibility
Enables sophisticated camera applications
Faster time to market and higher quality
ANDROID CAMERA HAL V3
KHRONOS CAMERA API Specification available 2015
Also FCAM-Based
— Will be available for any OS
— FCAM running on NVIDIA Linux today
No global state
— State travels with image requests
— Every stage in the pipeline may have different state
— Enables fast, deterministic state changes
Synchronize devices
— Lens, flash, sound capture, gyro…
— Devices can schedule Actions
— E.g. to be triggered on exposure change
KHRONOS CAMERA API REQUIREMENTS Application control over ISP processing (including 3A)
— Including multiple, re-entrant ISPs
Control multiple sensors with synch and alignment
— E.g. Stereo pairs, Plenoptic arrays, TOF/structured light depth cameras
Enhanced per frame detailed control
— Format flexibility, Region of Interest (ROI) selection
Global timing & synchronization
— E.g. Between cameras and MEMS sensors
Flexible processing/streaming
— Multiple input and output streams
— RAW, Bayer or YUV Processing
— Streaming of rows (not just frames)
Enable new camera
functionality not available
on current platforms and
align with future platform
directions for easy adoption
TEGRA K1 CAMERA DATAFLOW
Flexible routing of sensor data to ISPs and unified memory
Enables sensor processing by any combination of GPUs, CPUs and ISPs
Kepler
GPGPU
Camera
One
Camera
Two
Cam
era
Routin
g
Unified
Memory
CPU CPU
CPU CPU
ISP-A
ISP-B
Camera ISP (Image Signal Processor)
— Little or no programmability
CPU
— Single processor or Neon SIMD - running fast
— Makes heavy use of general memory
— Non-optimal performance and power
GPU
— Programmable and flexible
— Many way parallelism - run at lower frequency
— Efficient image caching close to processors
— BUT cycles frames in and out of memory
COMPUTATIONAL PHOTOGRAPHY
~760 math Ops
~42K vals = 670Kb
300MHz ~250Gops
WHAT IS NEEDED? Accelerated image & vision processing
— Tegra K1: CPU, GPU, ISP
Camera flexibility and handling of new sensor types
— Android HAL V3, V4L
Low latency routing of image streams to GPU
— EGLStreams
Integrated handling of various sensors and camera
— Global Time Stamps
Effective image & vision programming frameworks
— VisionWorks
EGL 1.5 RELEASED EGL 1.5 brings functionality from multiple
extensions into core
— Increased reliability and portability
EGLImages
— Sharing textures and renderbuffers
Context Robustness
— Defending against malicious code
EGLSync objects
— Improved OpenGL /OpenCL interop
Platform extensions
— Standardized interactions for multiple OS e.g. Android and 64-bit platforms
sRGB colorspace rendering
Applications
OS and Display
Platforms
Application Portability EGL abstracts graphics context
management, surface and
buffer binding and rendering
synchronization
API Interop EGL provides efficient
transfer of data and events
between Khronos APIs
WHAT IS NEEDED? Accelerated image & vision processing
— Tegra K1: CPU, GPU, ISP
Camera flexibility and handling of new sensor types
— Android HAL V3, V4L
Low latency routing of image streams to GPU
— EGLStreams
Integrated handling of various sensors and camera
— Global Time Stamps
Effective image & vision programming frameworks
— VisionWorks
HOW MANY SENSORS ARE IN A SMARTPHONE? Light
Proximity
2 cameras
3 microphones
Touch
Position
— GPS
— WiFi (fingerprint)
— Cellular (tri-lateration)
— NFC, Bluetooth (beacons)
Accelerometer
Magnetometer
Gyroscope
Pressure
Temperature
Humidity
27
19
STREAMINPUT SENSOR ABSTRACTION API
Apps Need Sophisticated Access to Sensor Data Without coding to specific
sensor hardware
Apps request semantic sensor information StreamInput defines possible requests, e.g.
Read Physical or Virtual Sensors e.g. “Game Quaternion”
Context detection e.g. “Am I in an elevator?”
StreamInput processing graph provides
optimized sensor data stream High-value, smart sensor fusion middleware can connect
to apps in a portable way
Apps can gain ‘magical’ situational awareness
Advanced Sensors Everywhere Multi-axis motion/position, quaternions,
context-awareness, gestures, activity monitoring, health and environmental sensors
Sensor Discoverability
Sensor Code Portability
WHAT IS NEEDED? Accelerated image & vision processing
— Tegra K1: CPU, GPU, ISP
Camera flexibility and handling of new sensor types
— Android HAL V3, V4L
Low latency routing of image streams to GPU
— EGLStreams
Integrated handling of various sensors and camera
— Global Time Stamps
Effective image & vision programming frameworks
— VisionWorks
MOBILE & EMBEDDED DEVELOPERS NEED HELP!
Control, coordinate and synchronize a diverse
array of mobile sensors
Write maintainable code for a heterogeneous mix of CPUs, GPUs and DSPs
Handle a diverse selection of emerging depth camera
technologies
Create fluid 60Hz experiences on battery-powered mobile devices
Leverage dedicated vision hardware for minimized power
Write code that is deployable across multiple devices,
platforms and OS
DESKTOP TO MOBILE DEVELOPMENT Image & Vision Processing code development starts at desktop
PC
Tegra K1 enables easy migration through libraries and tools available across platforms
TEGRA K1 CUDA DEVELOPMENT
CUDA-Aware Editor
Automated CPU to GPU code refactoring
Semantic highlighting of CUDA code
Integrated code samples & docs
Nsight Debugger
Simultaneously debug of CPU and GPU
Inspect variables across CUDA threads
Use breakpoints & single-step debugging
Nsight Profiler
Quickly identifies performance issues
Integrated expert system
Source line correlation
Cross platform development
Native memcheck, GDB, nvprof
VisionWorks
OpenCV
CUDA LIBRARIES
NPP CUFFT
CUBLAS CUDA Math Lib
NPP LIBRARY (CUDA) Data exchange &
initialization
— Set, Convert, CopyConstBorder, Copy, Transpose, SwapChannels
Arithmetic & Logical Ops
— Add, Sub, Mul, Div, AbsDiff
Threshold & Compare
— Threshold, Compare
Color Conversion
— RGB To YCbCr (& vice versa),
ColorTwist, LUT_Linear
JPEG
— DCTQuantInv/Fwd, QuantizationTable
Functions
— FilterBox, Row, Column, Max, Min, Median, Dilate, Erode, SumWindowColumn/Row
Geometry Transforms
— Mirror, WarpAffine / Back/ Quad, WarpPerspective / Back / Quad, Resize
Statistics
— Mean, StdDev, NormDiff, MinMax,
Histogram, SqrIntegral,
RectStdDev
Computer Vision
— ApplyHaarClassifier,
— GraphCuts
OPENCV LIBRARY Initially developed by Intel for single-core x86 CPUs
— Version 2.4.5 >900 functions (x the datatypes)
— OpenCV4Tegra - Accelerated CUDA+NEON+GLSL+TBB multithreading
General Image
Processing
Segmentation Machine Learning,
Detection Image Pyramids Transforms Fitting
Image processing
Video, Stereo, and 3D
Camera Calibration Features Depth Maps Optical Flow Inpainting Tracking
OpenCV
OPENCV-GPU VALUE ADD ON LOGAN
Speedup
0
1
2
3
4
5
6
7
core filter imgproc objdetect
Jetson Kepler GPU /Quadcore A15 Public OCV for Mobile/Embedded
Average speedup for different function categories with Logan GPU compared to public source code.
VISIONWORKS MOTIVATION
Advanced
Silicon
+ = Widespread vision
processing in embedded,
mobile and automotive
devices and applications
VisionWorks Simplify vision programming
Fully optimized and accelerated
Modular and Extensible
VISIONWORKS
ADAS – Advanced Driver Assistance Systems
Computational Photography
Augmented Reality Robotics
Power Efficient Computer Vision
Powered with CUDA
Supported on Tegra K1 Linux and Android
ACCELERATING • Advanced Driver Assistance • Computational Photography
• Augmented Reality • Robotics
• Deep Learning and more…
Version 0.10 is available for registered partners!
VisionWorks Primitives
VISIONWORKS SOFTWARE STACK
Application Code
Tegra K1
CUDA
Classifier Corner
Detection
3rd Party
Primitives
Sample Pipelines
Object
Detection
3rd Party
Pipelines …
…
SfM
NVIDIA supplied vision
primitives using CUDA and
Tegra processing resources
Customers and developers
can create their own
primitives e.g. using CUDA
OpenVX enables power
efficient and flexible
chaining of primitives
NVIDIA provides sample
pipelines for common use
cases
Applications use combination
of direct primitives, the
OpenVX framework and
supplied pipelines
Framework
OPENVX – POWER EFFICIENT VISION ACCELERATION
Khronos, open, cross-vendor vision API
— Focus on mobile and embedded systems
Foundational API for vision acceleration
— Useful for middleware or by applications
— Enables diverse efficient implementations
Complementary to OpenCV
— Which is great for prototyping
Open source sample
implementation
Hardware vendor
implementations
OpenCV open
source library VisionWorks
Sample Pipelines
Application
OpenCV
OPENVX GRAPHS – THE KEY TO EFFICIENCY Directed graphs for processing power and efficiency
— Each Node can be implemented in software or accelerated hardware
— Nodes may be fused to eliminate memory transfers
— Processing can be tiled to keep data entirely in local memory/cache
EGLStreams route data from camera and to application
Can extend with “VisionWorks” nodes using CUDA
OpenVX Node
VisionWorks Node
OpenVX Node
VisionWorks Node
Application Native
Camera
Control
Example OpenVX Graph
OPENVX AND OPENCV ARE COMPLEMENTARY
Governance Community driven open source
with no formal specification
Defined and
implemented by Khronos
Portability APIs can vary depending on processor Tegra K1 mobile platforms
Scope Very wide
1000s of imaging and vision functions
Multiple camera APIs/interfaces
Tight focus on hardware accelerated functions
for mobile vision
Use external camera API
Efficiency Memory-based architecture
Each operation reads and writes
memory
Graph-based execution
Optimizable computation, data transfer
Use Case Rapid experimentation Production development & deployment
OPENVX 1.0 FUNCTION OVERVIEW Core data structures
— Images and Image Pyramids
— Processing Graphs, Kernels, Parameters
Image Processing
— Arithmetic, Logical, and statistical operations
— Multichannel Color and BitDepth Extraction and Conversion
— 2D Filtering and Morphological operations
— Image Resizing and Warping
Core Computer Vision
— Pyramid computation
— Integral Image computation
Feature Extraction and Tracking
— Histogram Computation and Equalization
— Canny Edge Detection
— Harris and FAST Corner detection
— Sparse Optical Flow
OpenVX 1.0 defines
framework for
creating, managing and
executing graphs
Focused set of widely
used functions that are
readily accelerated
Implementers can add
functions as extensions
Widely used extensions
adopted into future
versions of the core
OpenVX Specification
Evolution
VISIONWORKS PRIMITIVES – JAN 2014
Sobel
Convolve
Bilateral Filter
Integral Image
Integral Histogram
Corner Harris
Corner FAST
Image Pyramid
Optical Flow PyrLK
Optical Flow Farneback
Warp Perspective
Hough Lines
Fast NLM Denoising
Stereo Block Matching
IME (Iterative Motion
Estimation)
HOG (Histogram of
Oriented Gradients)
Soft Cascade Detector
Object Tracker
TLD Object Tracker
SLAM
Path Estimator
MedianFlow Estimator
OPENVX AND CUDA ARE COMPLEMENTARY
Use Case GPGPU Programming Domain targeted Vision processing
Architecture Language-based Library-based
- no separate compiler required
Target Hardware
‘Exposed’ architected memory model – programmer manages memory
Abstracted node and memory model - diverse implementations can be optimized
for power and performance
Precision Full IEEE floating point mandated Minimal floating point requirements –
optimized for vision operators
Ease of Use General-purpose math and other
libraries Fully implemented vision operators and
framework ‘out of the box’
Use CUDA to build new VisionWorks OpenVX Nodes
VISIONWORKS SAMPLE PIPELINES (V0.10)
Structure From Motion/SLAM
Pedestrian Detection Vehicle detection Object tracking
Dense optical flow Active Shape Model Denoising
VISIONWORKS – LOOKING FORWARD
Enable multi-camera applications
3D sensors
Conformance with OpenVX once specification finalized
TEGRA K1 DEVELOPMENT PLATFORMS
JETSON X3 (TK1 PRO)
gigE, usb3.0, HDMI, CANBUS
running Vibrante Linux
AUTOMOTIVE GRADE
Coming to Android K1
& Other Linux
Devices soon..
QUESTIONS?
BACKUP