bridging the semantic gap - university of illinois at ...ece417/lecturenotes/ece417 spring...
TRANSCRIPT
Bridging the Semantic Gap Bridging the Semantic Gap
ECE 417 Spring 2013
Mert Dikmen
ECE 417 Spring 2013
Mert Dikmen
Semantic Gap Semantic Gap
Computer
Representation
Semantic Gap
Natural
Language
Representation
Semantic Gap Semantic Gap
Green
Semantic Gap Semantic Gap
Corner
Semantic Gap Semantic Gap
Roof
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Computer
Representation
Semantic Gap
Natural
Language
Representation
Semantic Gap Semantic Gap
Green
Semantic Gap Semantic Gap
Corner
Semantic Gap Semantic Gap
Roof
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Green
Semantic Gap Semantic Gap
Corner
Semantic Gap Semantic Gap
Roof
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Corner
Semantic Gap Semantic Gap
Roof
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Roof
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Ski Slope
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Resort
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap Semantic Gap
Fun
Holiday
Beautifulhellip
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Semantic Gap in Multimedia Semantic Gap in Multimedia
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Retrieval Given a description retrieve all ldquorelevantrdquo content from a
database
Parsing Given an input formulate a natural language description
Subtasks
Detection (find ldquothingsrdquo)
Segmentation (find the boundaries of ldquothingsrdquo)
Recognition (assign category)
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Multimedia Analysis Competitions and
Evaluations
Multimedia Analysis Competitions and
Evaluations
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Moderate size dataset
Training set with labels
Evaluation set without labels
Constrained problem
Detect well defined actions
Detect words or concepts
Well defined metric
Challenges
Algorithm design
Computation
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Star Challenge Star Challenge
PART I visual data processing PART I visual data processing
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
What is Star Challenge What is Star Challenge
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology
Hosted by the Agency for Science Technology and Research (ASTAR) Singapore
A real-world computer vision task which requires large amounts of computation power
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
But low rewards But low rewards
56 teams
from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams Grand Final
in Singapore
No rewards No rewards No rewards
No rewards
Only one team
can win
US$100000
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi
Vong Xu Mert Dennis Jason Andrey Yuxiao
But we have a team with no fearshellip But we have a team with no fearshellip
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Letrsquos go over our experience and
storieshellip
Letrsquos go over our experience and
storieshellip
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Outlines Outlines
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
Problems of Visual Retrieval
Data
Features
Algorithms
Results (first 3 rounds)
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set
AT1 IPA sequence segments that contain the
query IPA sequence
regardless of its languages
Mean Average Precision
25 hours
monolingual
database in
round1
13 hours
multilingual
database in
round3
AT2 an utterance spoken
by different speakers all segments that contain the
query wordphrasesentence
regardless of its spoken
languages
AT3 No queries extract all recurrent segments
which are at least 1 second in
length
F-measure
Xiaodan will talk about this parthelliphellip
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
3 Video Retrieval Tasks 3 Video Retrieval Tasks
Task Query Target Criteria Metric Data Set
VT1 Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
ldquovisually
similarrdquo
Mean Average Precision
20 categories
multiple labels
possible
VT2 Short
Video
Shot
(lt10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
10 categories
multiple labels
possible
VT3 Videos
with
sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy 10(20)
categories
including one
ldquoothersrdquo
category
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
20 VT1 Categories 20 VT1 Categories
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
100 Not-Applicable None of the labels
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
103 Mobile devices including handphonePDA
104 Flag
105 Electronic chart eg stock charts airport departure chart
106 TV chart Overlay including graphs text PowerPoint style
107 Person using Computer both visible
108 Track and field sports
109 Company Trademark including billboard logo
110 Badminton court sports
111 Swimming pool sports
112 Close-up of hand eg using mouse writing etc
113 Business meeting (gt 2 people) mostly seated down table visible
114 Natural scene eg mountain trees sea no people
115 Food on dishes plates
116 Face close-up occupying about 34 of screen frontal or side
117 Traffic Scene many cars trucks road visible
118 BoatShip over sea lake
119 PC Webpages screen of PC visible
120 Airplane
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
10 Categories for VT2 10 Categories for VT2
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
201 People enteringexiting doorcar
202 Talking face with introductory caption
203 Fingers typing on a keyboard
204 Inside a moving vehicle looking outside
205 Large camera movement tracking an object person car etc
206 Static or minute camera movement people(s) walking legs visible
207 Large camera movement panning leftright topdown of a scene
208 Movie ending credit
209 Woman monologue
210 Sports celebratory hug
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
5 Categories for VT3 5 Categories for VT3
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
101 Crowd (gt10 people)
102 Building with sky as backdrop clearly visible
107 Person using Computer both visible
112 Closeup of hand eg using mouse writing etc
116 Face closeup occupying about 34 of screen frontal or side
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
1) Audio search (AT1 or AT2)
5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Examples of Images Examples of Images
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
More samples More samples
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Evaluation Video Data of Round2 Evaluation Video Data of Round2
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
31 Mpeg Videos ~20 hours
17289 frames for VT1 in total
40994 frames for VT2 in total
32508 pseudo key frames 8486 real key frames
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Evaluation Video Data of Round3 Evaluation Video Data of Round3
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Video Files 27 Mpeg1 files (13 hours of videoaudio in total)
Key frames for VT1 10580 jpg files
Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg
files (pseudo key frames)
Video 352288
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Computation Powers Computation Powers
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Work Stations in IFP
10 Servers 2~4 CPU each 36CPU in total
IFP-32 Cluster 32 dual-core 28G 64bit CPU
CSL Cluster
Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks
Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node
TeraGrid
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Time Cost for Video Tasks Time Cost for Video Tasks
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Data Decompression 15 minutes
Video Format Conversion 2 hours
Video Segmentation (for VT2) 40 minutes
Sound Track Extraction 30 minutes
Feature Extraction
Global Feature 2 2 hours (c)
Global Feature 1 2 hours (c)
Patch-based Feature1 2 hours (c)
Patch-based Feature2 5 hours (matlab)
Semantic Feature 1 24 hours (matlab)
Semantic Feature 2 3 hours (c)
Semantic Feature 3 4 hours (c)
Motion Feature 1 24 hours (matlab)
Motion Feature 2 3 hours on t-Illiac
Classifier Training
Classifier 1 1 hour (on IFP cluster25 CPU matlab)
Classifier 2 20 minutes
Classifier 3 less than 10 minutes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Possible Accelerations for Video Possible Accelerations for Video
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Matlab codes to C
Parallel computing
GPU Acceleration
Patch based features
Load time is the major issue
Extracting all the features after one load
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Features for Round2- VT1 Features for Round2- VT1
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Image Features
SIFT
HOG
GIST
APC
LBP
Color Texture and etc
Semantic Feature
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Features for Round2-VT2 Features for Round2-VT2
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Character Detector
Harris corner
morphological operations
Optical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
GUFE Grand Unified Feature
Extractor
GUFE Grand Unified Feature
Extractor
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Designed by Dennis
Collects features generated by team members into one standard format
Retrieval by Query Expansion based on NN
Feature NormalizationCombination
Result Visualization
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Observations Observations 1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
1 Samples under the same category are more semantic
similar to each other
2 The shot boundaries are not well defined
3 some of the key frames are not labeled correctly
eg VT1 101 103(26-141)
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Algorithms Algorithms
Input a query image and its category number
0 Preprocessing compute the matching between the evaluation and the
development data
Query Expansion
1 Expand the query image by retrieving all the images from the development
data set with the same category
2 Search the evaluation set with the expanded query
Output return the top 5020 results
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Algorithms Algorithms
Motivation using a GMM to model the distribution of
patches
1 Train a UBM (Universal Background Model) based on
patches from all training images
2 MAP Estimation of the distribution of the patches
belonging to one image given UBM
3 Compute pair-wise image distance based on patch
kernel and within-class covariance normalization
3 Retrieving images based the normalized distance
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
VT1 Performance (2 in 8) VT1 Performance (2 in 8)
Category MAP
bull101 Crowd (gt10 people) 08419
bull102 Building with sky as backdrop clearly visible 0977
bull103 Mobile devices including handphonePDA 0028
bull107 Person using Computer both visible 02281
bull109 Company Trademark including billboard logo 096
bull112 Closeup of hand eg using mouse writing etc 04584
bull113 Business meeting (gt 2 people) mostly seated down table visible 00644
bull115 Food on dishes plates 02285
bull116 Face closeup occupying about 34 of screen frontal or side 09783
bull117 Traffic Scene many cars trucks road visible 02901
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
VT2 Performance(1in8) VT2 Performance(1in8)
Category MAP
bull202 Talking face with introductory caption 08432
bull206 Static or minute camera movement people(s)
walking legs visible 00581
bull207 Large camera movement panning leftright
topdown of a scene 07789
bull208 Movie ending credit 02782
bull209 Woman monologue Zhen 09756
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Performance of Round3 (1in7) Performance of Round3 (1in7)
Task 2 (VT1)
Target Estimated MAP (R=20)
101 Crowd (gt10 people) 064
102 Building with sky as backdrop clearly visible 1
107 Person using Computer both visible 07
112 Closeup of hand eg using mouse writing etc 0527
116 Face closeup occupying about 34 of screen frontal or side 1
Task3 (AT1 + VT2)
Retrieval Target VT2 only AT1 + VT2
Video R=20
202 face with introductory caption 1 003
209 women monolog 035 01
201 People entering door NA
We are
2nd in Audio search
4th in Video search
2nd in AV search
1st overall
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
TRECVid
TREC Text REtrieval Conference TRECVid Video Retrieval Workshop
Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection
Our Task
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Surveillance Event Detection
The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Surveillance Event Detection
List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Regional Averaging
Door OpenClose Information
Event Detections
Thresholding Rule
Detection of Opposing Flow Event
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Vision Video Library (ViVid) Utilizing GPUs in Computer Vision
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Research tool
Rapid development
Fast execution
Python glue layer
CUDA CC++
Integrates Libraries
Data flow
Lazy pull
Per frame referencing
Caches (lots of them)
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Motivations Motivations
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Most operations are highly local
Applications with real time (or faster) performance requirements
Surveillance
Soft biometrics
Multimedia Indexing
Visual Computing is here
Imaging and Photogrammetry
Pattern Recognition and Statistical Learning
Object Detection and Recognition
Dynamic Vision
Interactive and Internet Vision
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Working with ViVid Working with ViVid
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Why parallel Why parallel
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
Massive amounts of data
20 hours of video uploaded to YouTube every minute
15 billion photos on Facebook
Most operations are local and independent in the (xyt) space
Already available (GPUs)
If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Image Video Processing
Video Decoder
2D3D Convolution
2D3D Fourier Transform
Optical Flow
Feature Extraction
Motion Descriptor (Efros et al)
Motion History Descriptor
Random Video Interest Points
Histograms of Oriented Gradients Optical Flow
Analysis Vector Quantization
SVM Classifier Evaluation
ViVid ndash Video Computer Vision on Graphics Processors
Download
httpgithubcommertdikmenViVid
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
TRECVid 2008 System
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
TRECVid 2009 System TRECVid 2009 System
Features Video
Motion
Shape
Classifier
Event Label
bull Running
bull Pointing
bull Object Put
bull Cell To Ear
Vector
Quantization
Histogram
Interest Points
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Video Interest Point Detectors Video Interest Point Detectors
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
3D Harris Corner Detector
Corners
Dollar
Space Time Gabor
Corners
Periodic Motion
RSMB
Random Sampling of the Motion Boundary
Motion
Laptev
Dollar
RSMB
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
More is Good More is Good
Interest Point Detection Rates
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Video Features Video Features
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Descriptors of information relevant to the task
Motion
Shape
Appearance
Computationally intensive
Development
Application
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Motion History Images
(Bobbick amp Davis 2001)
Motion History Images
(Bobbick amp Davis 2001)
otherwise
1t)yD(xif
1)1)ty(xHmax(0
τt)y(xH
τ
τ
133
438
51
251
0 50 100 150 200 250 300
CUDA (feature + distance + argmin)
CUDA (distance + argmin)
CUDA (distance)
C
milliseconds per frame
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Histograms of
Oriented Gradients
Optical Flow
Histograms of
Oriented Gradients
Optical Flow
bull Partition the image window into local regions
bull Histogram the Image GradientOptical Flow based
on the direction and magnitude
bull Normalize over neighboring regions
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
K-Means Clustering K-Means Clustering
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
Vector quantization
Turns high dimensional features into discrete number of points
Given data find representative ldquocentersrdquo
Lloydrsquos algorithm
For each data point find the closest center
Update the center to be the mean of the associated data points
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
K-Means K-Means
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Relies heavily on pairwise distance
Large data sets
1 million features with 100-200 dimensions
1000 centers
Cannot fit output in GPU memory
Will need to reduce computation proceeds
Need efficient reduction operator
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Clustering Helps Clustering Helps
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Pairwise Distance Implementation Pairwise Distance Implementation
0 2000 4000 6000
CUDA
C
)bd(a)bd(a
)bd(a
)bd(a)bd(a)bd(a
nm1m
12
n11111
n1
m1
bbB
aaA
Compute Given
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
CPU vs GPU CPU vs GPU
Algorithmic properties that map well to GPUs
1 Independent and highly data local
computations
2Compute bound
3Little branch divergence
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU
Shared
Memory
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Pairwise Distance Computation Pairwise Distance Computation
A B
C
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Timings on TRECVid 2008 System Timings on TRECVid 2008 System
53 79
23
240 150
53 79
3030 4947
1
10
100
1000
10000
Fetch
Frame
Optical
Flow
Transfer to
GPU
Feature
Extraction
Pairwise
Distance
millise
co
nd
s
GPU + CPU CPU
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Benchmark Benchmark
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Dictionary Building Strategies Dictionary Building Strategies
Dictionary Size
Histogramming Method Rate of Detection
Low Medium High
1000 Raw 0681 0804 0844
Norm 0708 0799 0840
Mt Inf 0594 0804 0848
500 Raw 0675 0792 0833
Norm 0701 0791 0825
Mt Inf 0626 0783 0819
200 Raw 0671 0772 0811
Norm 0701 0779 0818
Mt Inf 0614 0720 0756
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 225 1050 1006
Cell To Ear 0 58 194 1060
Person Runs 1 38 106 0997
Object Put 1 190 620 1020
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Results (2009) Results (2009)
True Positives False
Alarm
Miss Min DCR
Pointing 13 (57) 225 (2505) 1050 1006
Cell To Ear 0 (8) 58 (4005) 194 1060
Person Runs 1 (0) 38 (314) 106 0997
Object Put 1 (21) 190 (2703) 620 1020
(2008 Results)
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Conclusions Conclusions
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Some practical problems are very hard to solve
Fusion of many different approaches
Take advantage of all available hardware
Cloud
GPUs
ContestsEvaluations Experience
Working with realistic data
Engineering Programming
Tight schedule streamlined development
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Examples of Evaluations Examples of Evaluations
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Trecvid 2012 Task - Semantic indexing (SIN)
Task - Known-item search (KIS)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
Pascal Visual Object Classes Pascal Visual Object Classes
Classificationdetection
Segmentation
Person Layout
Action Classification
Classificationdetection
Segmentation
Person Layout
Action Classification
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes
ImageNet
Large Scale Visual Recognition
ImageNet
Large Scale Visual Recognition
10000 Classes