bridging the semantic gap - university of illinois at ...ece417/lecturenotes/ece417 spring...

75
Bridging the Semantic Gap Bridging the Semantic Gap ECE 417 Spring 2013 Mert Dikmen ECE 417 Spring 2013 Mert Dikmen

Upload: dinhkiet

Post on 25-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Bridging the Semantic Gap Bridging the Semantic Gap

ECE 417 Spring 2013

Mert Dikmen

ECE 417 Spring 2013

Mert Dikmen

Semantic Gap Semantic Gap

Computer

Representation

Semantic Gap

Natural

Language

Representation

Semantic Gap Semantic Gap

Green

Semantic Gap Semantic Gap

Corner

Semantic Gap Semantic Gap

Roof

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 2: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Computer

Representation

Semantic Gap

Natural

Language

Representation

Semantic Gap Semantic Gap

Green

Semantic Gap Semantic Gap

Corner

Semantic Gap Semantic Gap

Roof

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 3: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Green

Semantic Gap Semantic Gap

Corner

Semantic Gap Semantic Gap

Roof

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 4: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Corner

Semantic Gap Semantic Gap

Roof

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 5: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Roof

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 6: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Ski Slope

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 7: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Resort

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 8: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap Semantic Gap

Fun

Holiday

Beautifulhellip

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 9: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Semantic Gap in Multimedia Semantic Gap in Multimedia

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Retrieval Given a description retrieve all ldquorelevantrdquo content from a

database

Parsing Given an input formulate a natural language description

Subtasks

Detection (find ldquothingsrdquo)

Segmentation (find the boundaries of ldquothingsrdquo)

Recognition (assign category)

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 10: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Multimedia Analysis Competitions and

Evaluations

Multimedia Analysis Competitions and

Evaluations

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Moderate size dataset

Training set with labels

Evaluation set without labels

Constrained problem

Detect well defined actions

Detect words or concepts

Well defined metric

Challenges

Algorithm design

Computation

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 11: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Star Challenge Star Challenge

PART I visual data processing PART I visual data processing

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 12: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

What is Star Challenge What is Star Challenge

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

Competition to Develop Worldrsquos Next-Generation Multimedia Search Technology

Hosted by the Agency for Science Technology and Research (ASTAR) Singapore

A real-world computer vision task which requires large amounts of computation power

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 13: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

But low rewards But low rewards

56 teams

from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams Grand Final

in Singapore

No rewards No rewards No rewards

No rewards

Only one team

can win

US$100000

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 14: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 15: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Xiaodan Lyon Paritosh Mark Tom Mandar Sean Jui-Ting Zhen Huazhong Xi

Vong Xu Mert Dennis Jason Andrey Yuxiao

But we have a team with no fearshellip But we have a team with no fearshellip

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 16: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Letrsquos go over our experience and

storieshellip

Letrsquos go over our experience and

storieshellip

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 17: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Outlines Outlines

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

Problems of Visual Retrieval

Data

Features

Algorithms

Results (first 3 rounds)

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 18: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

3 Audio Retrieval Tasks 3 Audio Retrieval Tasks Task Query Target Metric Data Set

AT1 IPA sequence segments that contain the

query IPA sequence

regardless of its languages

Mean Average Precision

25 hours

monolingual

database in

round1

13 hours

multilingual

database in

round3

AT2 an utterance spoken

by different speakers all segments that contain the

query wordphrasesentence

regardless of its spoken

languages

AT3 No queries extract all recurrent segments

which are at least 1 second in

length

F-measure

Xiaodan will talk about this parthelliphellip

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 19: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

3 Video Retrieval Tasks 3 Video Retrieval Tasks

Task Query Target Criteria Metric Data Set

VT1 Single

Image

20

queries

(short)

Video

Segs

All the

similar

Segs

ldquovisually

similarrdquo

Mean Average Precision

20 categories

multiple labels

possible

VT2 Short

Video

Shot

(lt10s)

20

queries

(long)

Video

Segs

All the

similar

Segs

Perceptually

Similar

10 categories

multiple labels

possible

VT3 Videos

with

sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy 10(20)

categories

including one

ldquoothersrdquo

category

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 20: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

20 VT1 Categories 20 VT1 Categories

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

100 Not-Applicable None of the labels

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

103 Mobile devices including handphonePDA

104 Flag

105 Electronic chart eg stock charts airport departure chart

106 TV chart Overlay including graphs text PowerPoint style

107 Person using Computer both visible

108 Track and field sports

109 Company Trademark including billboard logo

110 Badminton court sports

111 Swimming pool sports

112 Close-up of hand eg using mouse writing etc

113 Business meeting (gt 2 people) mostly seated down table visible

114 Natural scene eg mountain trees sea no people

115 Food on dishes plates

116 Face close-up occupying about 34 of screen frontal or side

117 Traffic Scene many cars trucks road visible

118 BoatShip over sea lake

119 PC Webpages screen of PC visible

120 Airplane

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 21: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

10 Categories for VT2 10 Categories for VT2

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

201 People enteringexiting doorcar

202 Talking face with introductory caption

203 Fingers typing on a keyboard

204 Inside a moving vehicle looking outside

205 Large camera movement tracking an object person car etc

206 Static or minute camera movement people(s) walking legs visible

207 Large camera movement panning leftright topdown of a scene

208 Movie ending credit

209 Woman monologue

210 Sports celebratory hug

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 22: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

5 Categories for VT3 5 Categories for VT3

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

101 Crowd (gt10 people)

102 Building with sky as backdrop clearly visible

107 Person using Computer both visible

112 Closeup of hand eg using mouse writing etc

116 Face closeup occupying about 34 of screen frontal or side

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 23: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Video+Audio Tasks in Round 3 Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

1) Audio search (AT1 or AT2)

5 queries will be given either in the form of IPA sequence or waveform and the participants are required to solve 4

2) Video search (VT1)

5 queries will be given and the participants are required to solve 4

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequencewaveform and video category The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequencewaveform and video category respectively 3 queries will be given and the participants are required to solve 2

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 24: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Examples of Images Examples of Images

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 25: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

More samples More samples

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 26: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Evaluation Video Data of Round2 Evaluation Video Data of Round2

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

31 Mpeg Videos ~20 hours

17289 frames for VT1 in total

40994 frames for VT2 in total

32508 pseudo key frames 8486 real key frames

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 27: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Evaluation Video Data of Round3 Evaluation Video Data of Round3

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Video Files 27 Mpeg1 files (13 hours of videoaudio in total)

Key frames for VT1 10580 jpg files

Key frames for VT2 64546 files in total including 10580 jpg files (true key frames) + 53966 jpg

files (pseudo key frames)

Video 352288

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 28: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Computation Powers Computation Powers

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Work Stations in IFP

10 Servers 2~4 CPU each 36CPU in total

IFP-32 Cluster 32 dual-core 28G 64bit CPU

CSL Cluster

Trusted-ILLIAC 256 nodes with dual 22 GHz Opterons 2 GB of RAM and 73 GB SCSI Ultra320 disks

Monolith 128 node cluster with dual Pentium III CPUs at 1 Ghz with 15 GB of RAM per node

TeraGrid

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 29: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Time Cost for Video Tasks Time Cost for Video Tasks

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Data Decompression 15 minutes

Video Format Conversion 2 hours

Video Segmentation (for VT2) 40 minutes

Sound Track Extraction 30 minutes

Feature Extraction

Global Feature 2 2 hours (c)

Global Feature 1 2 hours (c)

Patch-based Feature1 2 hours (c)

Patch-based Feature2 5 hours (matlab)

Semantic Feature 1 24 hours (matlab)

Semantic Feature 2 3 hours (c)

Semantic Feature 3 4 hours (c)

Motion Feature 1 24 hours (matlab)

Motion Feature 2 3 hours on t-Illiac

Classifier Training

Classifier 1 1 hour (on IFP cluster25 CPU matlab)

Classifier 2 20 minutes

Classifier 3 less than 10 minutes

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 30: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Possible Accelerations for Video Possible Accelerations for Video

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Matlab codes to C

Parallel computing

GPU Acceleration

Patch based features

Load time is the major issue

Extracting all the features after one load

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 31: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Features for Round2- VT1 Features for Round2- VT1

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Image Features

SIFT

HOG

GIST

APC

LBP

Color Texture and etc

Semantic Feature

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 32: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Features for Round2-VT2 Features for Round2-VT2

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Character Detector

Harris corner

morphological operations

Optical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 33: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

GUFE Grand Unified Feature

Extractor

GUFE Grand Unified Feature

Extractor

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Designed by Dennis

Collects features generated by team members into one standard format

Retrieval by Query Expansion based on NN

Feature NormalizationCombination

Result Visualization

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 34: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Observations Observations 1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

1 Samples under the same category are more semantic

similar to each other

2 The shot boundaries are not well defined

3 some of the key frames are not labeled correctly

eg VT1 101 103(26-141)

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 35: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Algorithms Algorithms

Input a query image and its category number

0 Preprocessing compute the matching between the evaluation and the

development data

Query Expansion

1 Expand the query image by retrieving all the images from the development

data set with the same category

2 Search the evaluation set with the expanded query

Output return the top 5020 results

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 36: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Algorithms Algorithms

Motivation using a GMM to model the distribution of

patches

1 Train a UBM (Universal Background Model) based on

patches from all training images

2 MAP Estimation of the distribution of the patches

belonging to one image given UBM

3 Compute pair-wise image distance based on patch

kernel and within-class covariance normalization

3 Retrieving images based the normalized distance

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 37: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

VT1 Performance (2 in 8) VT1 Performance (2 in 8)

Category MAP

bull101 Crowd (gt10 people) 08419

bull102 Building with sky as backdrop clearly visible 0977

bull103 Mobile devices including handphonePDA 0028

bull107 Person using Computer both visible 02281

bull109 Company Trademark including billboard logo 096

bull112 Closeup of hand eg using mouse writing etc 04584

bull113 Business meeting (gt 2 people) mostly seated down table visible 00644

bull115 Food on dishes plates 02285

bull116 Face closeup occupying about 34 of screen frontal or side 09783

bull117 Traffic Scene many cars trucks road visible 02901

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 38: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

VT2 Performance(1in8) VT2 Performance(1in8)

Category MAP

bull202 Talking face with introductory caption 08432

bull206 Static or minute camera movement people(s)

walking legs visible 00581

bull207 Large camera movement panning leftright

topdown of a scene 07789

bull208 Movie ending credit 02782

bull209 Woman monologue Zhen 09756

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 39: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Performance of Round3 (1in7) Performance of Round3 (1in7)

Task 2 (VT1)

Target Estimated MAP (R=20)

101 Crowd (gt10 people) 064

102 Building with sky as backdrop clearly visible 1

107 Person using Computer both visible 07

112 Closeup of hand eg using mouse writing etc 0527

116 Face closeup occupying about 34 of screen frontal or side 1

Task3 (AT1 + VT2)

Retrieval Target VT2 only AT1 + VT2

Video R=20

202 face with introductory caption 1 003

209 women monolog 035 01

201 People entering door NA

We are

2nd in Audio search

4th in Video search

2nd in AV search

1st overall

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 40: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

TRECVid

TREC Text REtrieval Conference TRECVid Video Retrieval Workshop

Shot Boundary Detection Copy Detection Video Search Summarization High Level Feature Extraction Surveillance Event Detection

Our Task

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 41: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Surveillance Event Detection

The Dataset Surveillance footage from London Gatwick Airport 5 stationary cameras Training set 100 Hours Testing set 44 Hours Frame size 700x540 px Dataset Size ~ 350 GB Frames ~12 million

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 42: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Surveillance Event Detection

List of Events 1 Cell to ear 2 Embrace 3 Object Put 4 Opposing flow 5 Pointing 6 Taking picture 7 Running 8 People meeting 9 People splitting 10Person not entering elevator

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 43: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Regional Averaging

Door OpenClose Information

Event Detections

Thresholding Rule

Detection of Opposing Flow Event

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 44: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Vision Video Library (ViVid) Utilizing GPUs in Computer Vision

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Research tool

Rapid development

Fast execution

Python glue layer

CUDA CC++

Integrates Libraries

Data flow

Lazy pull

Per frame referencing

Caches (lots of them)

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 45: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Motivations Motivations

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Most operations are highly local

Applications with real time (or faster) performance requirements

Surveillance

Soft biometrics

Multimedia Indexing

Visual Computing is here

Imaging and Photogrammetry

Pattern Recognition and Statistical Learning

Object Detection and Recognition

Dynamic Vision

Interactive and Internet Vision

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 46: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Working with ViVid Working with ViVid

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 47: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Why parallel Why parallel

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

Massive amounts of data

20 hours of video uploaded to YouTube every minute

15 billion photos on Facebook

Most operations are local and independent in the (xyt) space

Already available (GPUs)

If individual frames were to be counted as images YouTube replicates the entire Facebook image DB every ~5 days

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 48: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Image Video Processing

Video Decoder

2D3D Convolution

2D3D Fourier Transform

Optical Flow

Feature Extraction

Motion Descriptor (Efros et al)

Motion History Descriptor

Random Video Interest Points

Histograms of Oriented Gradients Optical Flow

Analysis Vector Quantization

SVM Classifier Evaluation

ViVid ndash Video Computer Vision on Graphics Processors

Download

httpgithubcommertdikmenViVid

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 49: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

TRECVid 2008 System

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 50: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

TRECVid 2009 System TRECVid 2009 System

Features Video

Motion

Shape

Classifier

Event Label

bull Running

bull Pointing

bull Object Put

bull Cell To Ear

Vector

Quantization

Histogram

Interest Points

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 51: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Video Interest Point Detectors Video Interest Point Detectors

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

3D Harris Corner Detector

Corners

Dollar

Space Time Gabor

Corners

Periodic Motion

RSMB

Random Sampling of the Motion Boundary

Motion

Laptev

Dollar

RSMB

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 52: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

More is Good More is Good

Interest Point Detection Rates

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 53: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Video Features Video Features

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Descriptors of information relevant to the task

Motion

Shape

Appearance

Computationally intensive

Development

Application

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 54: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Averaging Flow (Efros et al 2003) Averaging Flow (Efros et al 2003)

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 55: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Motion History Images

(Bobbick amp Davis 2001)

Motion History Images

(Bobbick amp Davis 2001)

otherwise

1t)yD(xif

1)1)ty(xHmax(0

τt)y(xH

τ

τ

133

438

51

251

0 50 100 150 200 250 300

CUDA (feature + distance + argmin)

CUDA (distance + argmin)

CUDA (distance)

C

milliseconds per frame

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 56: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 57: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 58: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Histograms of

Oriented Gradients

Optical Flow

Histograms of

Oriented Gradients

Optical Flow

bull Partition the image window into local regions

bull Histogram the Image GradientOptical Flow based

on the direction and magnitude

bull Normalize over neighboring regions

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 59: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

K-Means Clustering K-Means Clustering

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

Vector quantization

Turns high dimensional features into discrete number of points

Given data find representative ldquocentersrdquo

Lloydrsquos algorithm

For each data point find the closest center

Update the center to be the mean of the associated data points

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 60: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

K-Means K-Means

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Relies heavily on pairwise distance

Large data sets

1 million features with 100-200 dimensions

1000 centers

Cannot fit output in GPU memory

Will need to reduce computation proceeds

Need efficient reduction operator

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 61: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Clustering Helps Clustering Helps

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 62: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Pairwise Distance Implementation Pairwise Distance Implementation

0 2000 4000 6000

CUDA

C

)bd(a)bd(a

)bd(a

)bd(a)bd(a)bd(a

nm1m

12

n11111

n1

m1

bbB

aaA

Compute Given

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 63: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

CPU vs GPU CPU vs GPU

Algorithmic properties that map well to GPUs

1 Independent and highly data local

computations

2Compute bound

3Little branch divergence

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 64: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Pairwise Distance Computation on the GPU Pairwise Distance Computation on the GPU

Shared

Memory

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 65: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 66: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Pairwise Distance Computation Pairwise Distance Computation

A B

C

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 67: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Timings on TRECVid 2008 System Timings on TRECVid 2008 System

53 79

23

240 150

53 79

3030 4947

1

10

100

1000

10000

Fetch

Frame

Optical

Flow

Transfer to

GPU

Feature

Extraction

Pairwise

Distance

millise

co

nd

s

GPU + CPU CPU

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 68: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Benchmark Benchmark

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 69: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Dictionary Building Strategies Dictionary Building Strategies

Dictionary Size

Histogramming Method Rate of Detection

Low Medium High

1000 Raw 0681 0804 0844

Norm 0708 0799 0840

Mt Inf 0594 0804 0848

500 Raw 0675 0792 0833

Norm 0701 0791 0825

Mt Inf 0626 0783 0819

200 Raw 0671 0772 0811

Norm 0701 0779 0818

Mt Inf 0614 0720 0756

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 70: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 225 1050 1006

Cell To Ear 0 58 194 1060

Person Runs 1 38 106 0997

Object Put 1 190 620 1020

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 71: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Results (2009) Results (2009)

True Positives False

Alarm

Miss Min DCR

Pointing 13 (57) 225 (2505) 1050 1006

Cell To Ear 0 (8) 58 (4005) 194 1060

Person Runs 1 (0) 38 (314) 106 0997

Object Put 1 (21) 190 (2703) 620 1020

(2008 Results)

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 72: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Conclusions Conclusions

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Some practical problems are very hard to solve

Fusion of many different approaches

Take advantage of all available hardware

Cloud

GPUs

ContestsEvaluations Experience

Working with realistic data

Engineering Programming

Tight schedule streamlined development

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 73: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Examples of Evaluations Examples of Evaluations

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Trecvid 2012 Task - Semantic indexing (SIN)

Task - Known-item search (KIS)

Task - Interactive surveillance event detection (SED)

Task - Instance search (INS)

Task - Multimedia event detection (MED)

Task - Multimedia event recounting (MER)

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 74: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

Pascal Visual Object Classes Pascal Visual Object Classes

Classificationdetection

Segmentation

Person Layout

Action Classification

Classificationdetection

Segmentation

Person Layout

Action Classification

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes

Page 75: Bridging the Semantic Gap - University of Illinois at ...ece417/LectureNotes/ECE417 Spring 2013... · Bridging the Semantic Gap ECE 417 Spring 2013 ... Company Trademark, including

ImageNet

Large Scale Visual Recognition

ImageNet

Large Scale Visual Recognition

10000 Classes