visual tracking course @ ucu day 1: intro to tracking...

Visual Tracking course @ UCU

Day 1: Intro to tracking

Dmytro Mishkin

Center for Machine Perception

Czech Technical University in Prague

Lviv, Ukraine

2017

Heavily based on tutorial:

Visual Tracking

by Jiri Matas

„… Although tracking itself is by and

large solved problem...“,

-- Jianbo Shi & Carlo Tomasi

CVPR1994 --

Course plan

Day 1: Basics

Visual tracking: not one, but many problems.

The KLT tracker and Optical Flow

HW: KLT and simple Tracker by Detection

Day 2: CNN (mostly not about tracking)

General CNN finetuning

CNN design choices

SotA: CNN-based trackers

HW: Finetuning VGGNet for MDNet tracker

Day 3: State-of-art (continue)

Discriminative Correlation Filters

How to evaluate tracker

HW: KCF implementation

3/150

Outline of the Day 1

1. Visual tracking: not one, but many problems.

2. The KLT tracker. Optical Flow. SIFT Flow. Flock of

trackers.

3. The Mean-Shift tracker

4. Tracking by detection (basic)

4/150

Application domains of Visual Tracking

• monitoring, assistance, surveillance,

control, defense

• robotics, autonomous car driving,

rescue

• measurements: medicine, sport,

biology, meteorology

• human computer interaction

• augmented reality

• film production and postproduction:

motion capture, editing

• management of video content: indexing,

search

• action and activity recognition

• image stabilization

• mobile applications

• camera “tracking”5/150Slide credit: Jiri Matas

Applications, applications, applications, …

6/150Slide credit: Jiri Matas

Tracking Applications ….

– Team sports: game analysis, player statistics, video annotation, …


Sport examples

http://cvlab.epfl.ch/~lepetit/http://www.dartfish.com/en/media-gallery/videos/index.htm

Slide Credit: Patrick Perez 8/150

http://cvlab.epfl.ch/~lepetit/

Model-based Tracking: People and Faces

http://cvlab.epfl.ch/research/completed/realtime_tracking/ http://www.cs.brown.edu/~black/3Dtracking.html

Slide Credit: Patrick Perez 9/150

http://cvlab.epfl.ch/research/completed/realtime_tracking/http://www.cs.brown.edu/~black/3Dtracking.html

Tracking is commonly used in practice…

10/150

• Tracking is popular research topic for decades:

see CVPR, ICCV, ECCV, …

• But there is no online course devoted to tracking…

• nor big coverage in computer vision courses...

• nor it is well covered in textbooks

Is it clear, what tracking is?

video credit:

Helmut

Grabner


Tracking: Formulation - Literature

Surprisingly little is said about tracking in standard textbooks.

Limited to optic flow, plus some basic trackers, e.g. Lucas-Kanade.

Definition (0):

[Forsyth and Ponce, Computer Vision: A modern approach, 2003]

“Tracking is the problem of generating an inference about the

motion of an object given a sequence of images.

Good solutions of this problem have a variety of applications…”


Formulation (1): Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes:

• The concept of an “object” in F&P definition disappeared.

• If an algorithm correctly established such correspondences,

would that be a perfect tracker?

• tracking = motion estimation?


Tracking is Motion Estimation / Optic Flow ?


Tracking is Motion Estimation / Optic Flow ?

15/150

Tracking is Motion Estimation / Optic Flow?

Motion “pattern” Camera tracking

Dense motion field

http://www.cs.cmu.edu/~saada/Projects/CrowdSegmentation/

http://www.youtube.com/watch?v=ckVQrwYIjAs

Sparse motion field estimate



17/150

Motion field

• The motion field is the projection of the 3D scene

motion into the image

Slide credit: James Hays


18/150

• Definition: optical flow is the apparent motion of

brightness patterns in the image

• Ideally, optical flow would be the same as the

motion field

• Have to be careful: apparent motion can be caused

by lighting changes without any actual motion

• Think of a uniform rotating sphere under fixed

lighting vs. a stationary sphere under moving

illumination

Slide credit: James Hays


19/150Slide credit: Jason Corso

Optic Flow

Standard formulation:

• At every pixel, 2D displacement is estimated between consecutive frames

Missing:

• occlusion – disocclusion handling: pixels visible in one image only

- in the standard formulation, “don’t know” is not an answer

• considering the 3D nature of the world

• large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow:

• is the ground truth ever known?

- learning and performance evaluation problematic (synthetic sequences ..)

• requires generic regularization (smoothing)

• failure (assumption validity) not easy to detect

In certain applications, tracking is motion estimation on a part of the image

with specific constraints: augmented reality, sports analysis 20/150



in consecutive frames of an image sequence

Notes:

• The concept of an “object” in F&P definition disappeared.

• If an algorithm correctly established such correspondences,

would that be a perfect tracker?

• tracking = motion estimation?

Consider the Bolt sequence:

21/150

Formulation (1*): Tracking


between all pairs of frames in an image sequences

• which leads to the concept of long-term tracking,

to be discussed later

22/150

Definition (2): Tracking

Given an initial estimate of its position,

locate X in a sequence of images,

Where X may mean:

• A (rectangular) region

• An “interest point” and its neighbourhood

• An “object”

This definition is adopted e.g. in a recent book by

Maggio and Cavallaro, Video Tracking, 2011

Smeulders T-PAMI13:

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame.

23/150

Formulation (3): Tracking as Segmentation

J. Fan et al. Closed-Loop Adaptation for Robust Tracking, ECCV 2010

24/150

Tracking as model-based segmentation

25/150

Tracking as segmentation

http://vision.ucsd.edu/~kbranson/research/cvpr2005.html

http://www2.imm.dtu.dk/~aam/tracking/

• heart

26/150

http://vision.ucsd.edu/~kbranson/research/cvpr2005.htmlhttp://www2.imm.dtu.dk/~aam/tracking/sequences-for-presentation/heart-guillaume-us.avi

A “standard” CV tracking method output

27/150

Approximate motion estimation, approximate segmentation.Neither good optic flow, neither precise segmentation required.

Rotated B-Boxes – Interpretation?

28/150

Rotated B-Boxes – Interpretation?

29/150


Given an initial estimate of the pose and state of X :

In all images in a sequence, (in a causal manner)

1. estimate the pose and state of X

2. (optionally) update the model of X

• Pose: any geometric parameter (position, scale, …)

• State: appearance, shape/segmentation, visibility, articulations

• Model update: essentially a semi-supervised learning problem

– a priori information (appearance, shape, dynamics, …)

– labeled data (“track this”) + unlabeled data = the sequences

• Causal: for estimation at T, use information from time t · T

30/150

Tracking for Black Mirror Blocking

31/150

Tracking in 6D.

32/150

Tracking-Learning-Detection (TLD)

33/150

A “miracle”: Tracking a Transparent Object

video credit:

Helmut

Grabner

H. Grabner, H. Bischof, On-line boosting and vision, CVPR, 2006.

34/150

Tracking the “Invisible”

H. Grabner, J. Matas, L. Gool, P. Cattin,Tracking the invisible: learning where the object might be, CVPR 2010.

35/150

video

sequences-for-presentation/Octopus hunting and changing color - YouTube [360p].mp4


Given an estimate of the pose (and state) of X in “key” images

(and a priori information about X),

In all images in a sequence, (in a causal manner):

1. estimate the pose and state of X

2. (optionally) estimate the state of the scene [ e.g. “supporters”]

3. (optionally) update the model of X

Out: a sequence of poses (and states),(and/or the learned model of X)

Notes:

• Often, not all parameters of pose/state are of interest, and the state is

estimated as a side-effect.

• If model acquisition is the desired output, the pose/state estimation is a

side-effect.

• The model may include: relational constraints and dynamics, appearance

change as a function as pose and state

37/150

Short-term v. Long-term Tracking v. OFShort-term Trackers:

• primary objective: “where is X?” = precise estimation of pose

• secondary: be fast; don’t lose track

• evaluation methodology: frame number where failure occurred

• examples: Lucas Kanade tracker, mean-shift tracker

Long-term Tracker-Detectors:

• primary objective: unsupervised learning of a detector, since

every (short-term) tracker fails, sooner or later

(disappearance from the field of view, full occlusion)

• avoid the “first failure means lost forever” problem

• close to online-learned detector, but assumes and exploits the fact

that a sequence with temporal pose/state dependence is available

• evaluation methodology: precision/recall, false positive/negative

rates (i.e. like detectors)

• note: the detector part may help even for short-term tracking

problems, provides robustness to fast, unpredictable motions.

Optic Flow, Motion estimation: establish all correspondences a sequence

38/150

Other Tracking Problems:

…… multiple object tracking ….

39/150

Multi-object Tracking

• ant tracking 1

• result 1

Tracking as detection and identification

41/150

sequences-naiser-ants/eight_00m31s.m4vsequences-naiser-ants/Ferda1080.mp4

Other Tracking Problems:

Cell division.

http://www.youtube.com/watch?v=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster.

http://www.youtube.com/watch?v=YFKA647w4Jg

splitting and merging events ….

42/150

The World of Fast Moving Objects

● FMO - object that moves over a distance exceeding its size within exposure time

● Standard datasets (VOT, OTB, ALOV) do not include FMOs

FMO Examples

● Ping pong, tennis, frisbee, volleyball, badminton, squash, darts, arrows, softball

● Some FMOs are nearly homogeneous, while some have coloured texture

● SOTA trackers fail...

Applications: deblurring, temporal superresolution

Estimation of Appearance

● Reconstruction of FMOs blurred by motion and rotation● Axis of rotation, angle of rotation, full 3D appearance, …

Motion Estimation from a Single Image

49/150

Tracking problem variations:

• multiple cameras

• RGBD sensors

• combination of sensors (accelerometer + visual, IMU + visual, IMU +

GPS + visual)

50/150

Tracking problems

• motion estimation (establishing point-to-point correspondences) v.

segmentation (region-to-region correspondences)

• long-term v. short-term

• one object v. multiple objects

• casual v. non-causal (= offline video analysis)

• single v. multi-camera

• static v. moving camera

51/150

The KLT tracker

52/150

The KLT Tracker

53/150Slide credit: Kris Kitani

KLT Tracker

slide credit:Tomas Svoboda

54/150

Importance in Computer Vision

• Firstly published in 1981 as an image registration method [3].

• Improved many times, most importantly by Carlo Tomasi [5,4]

• Free implementaton(s) available1.

• After more than two decades, a project2 at CMU dedicated to this

• single algorithm and results published in a premium journal [1].

• Part of plethora computer vision algorithms.1http://www.ces.clemson.edu/~stb/klt/2http://www.ri.cmu.edu/projects/project_515.html

Image alignment

slide credit:Tomas Svoboda 55/150

Goal is to align a template image T(x) to an input image I(x). X - columnvector containing image coordinates [x, y]. The I(x) could be also a smallsubwindow within an image.

Original Lucas-Kanade algorithm I


Original Lucas-Kanade algorithm II


Original Lucas-Kanade algorithm III


Original Lucas-Kanade algorithm IV


Original Lucas-Kanade summary

slide credit: Tomas Svoboda 60/150

KLT Tracker

For good conditioning, patch must be

textured/structured enough:

• Uniform patch: no information

• Contour element: aperture problem (one dimensional

information)

• Corners, blobs and texture: best estimate

[Lucas and Kanade 1981][Tomasi and Shi, CVPR’94]

slide credit:

Patrick Perez

61/150

Aperture Problem: Example

62/150Image from: Gary Bradski slides

If we look through small hole…

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

– Arbitrary displacement

• Multi-resolution approach: Gauss-Newton like approximation down image

pyramid

63/150Slide credit: James Hays

Monitoring quality

– Translation is usually sufficient for small fragments, but:

• Perspective transforms and occlusions cause drift and loss

– Two complementary options

• Kill tracklets when minimum SSD too large

• Compare as well with initial patch under affine transform (warp) assumption

slide credit:

Patrick Perez

64/150

Characteristics of KLT

• cost function: sum of squared intensity differences

between template and window

• optimization technique: gradient descent

• model learning: no update / last frame / convex

combination

• attractive properties:

–fast

–easily extended to image-to-image transformations with

multiple parameters

65/150

From KLT to whole image optical flow

KLT:


between template and window

• optimization technique: gradient descent

Optical Flow( Horn & Schunk, 1981):


between two images + smoothness

constraint

• optimization technique: Gauss–Seidel

66/150

Optical flow

Function:

67/150slide credit: Cordelia Schmid

SIFT Flow: replace pixels with dense SIFT

• cost function: sum of absolute SIFT differences

between template and window +

small displacement term +

smoothness term

70/150Image from Liu et.al 2011, TPAMI

• optimization technique: dual-layer loopy belief propagation

w(p)=(u(p), v(p)) is the displacement vector at pixel location p= (x, y)

https://cseweb.ucsd.edu/classes/sp06/cse151/lectures/belief-propagation.pdf

SIFT Flow: replace pixels with dense SIFT

71/150Image from Liu et.al 2011, TPAMI

SIFT is not critical – could any “descriptor flow”

EpicFlow

72/150

• Avoid coarse-to-fine scheme

• Other matching-based approaches [Chen’13, Lu’13,

Leordeanu’13]

– less robust matching algorithms (eg. PatchMatch)

– complex schemes to handle occlusions & motion discontinuities

– no geodesic distance7272

(SED)

(DeepMatching)

EpicFlow Revaud et.al, 2015. slide credit: Jerome Revaud

FlowNet

73/150Fischer et.al 2015, FlowNet: Learning Optical Flow with Convolutional Networks. slide credit: Thomas Brox

FlowNet 2.0: Refine and refine

74/150Ilg et.al 2016, FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks.

Sintel benchmark

75/150Butler. et.al 2012. A naturalistic open source movie for optical flow evaluation

SotA Optical Flow: qualitive results

76/150Ilg et.al 2016, FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks.

The Flock of Trackers

(with error prediction)

Tomas Vojir

77/150

The Flock of Trackers

• A n x m grid (say 10x10) of

Lucas-Kanade / ZSP trackers

• Tracker initialised on a

regular grid

• Robust estimation of global,

either “median”

direction/scale or RANSAC

(up to homography)

• Each tracker has a

failure predictor

78/150

2016.12.12 Oulu J. Matas:Tracking

Two classical Failure Predictors

Normalized Cross-correlation

• Compute normalized cross-

correlation between local tracker

patch in time t

and t+1

• Sort local trackers according to

NCC response

• Filter out bottom 50% (Median)

Forward-Backward1

• Compute correspondences of local

trackers from time t to t+k and t+k

to t and measure the k-step error

• Sort local trackers according to the

k-step error

• Filter out bottom 50% (Median)

[1] Z. Kalal, K. Mikolajczyk, and J. Matas.

Forward-Backward Error: Automatic Detection of Tracking Failures. ICPR, 2010

79/150

Failure Predictor: Neighborhood Consistency

• For each local tracker i is computed neighborhood

consistency score as follows :

Ni is four neighbourhood of local tracker i, is displacement and is displacement error threshold

• Local trackers with

SiNh < Nh

are filtered out

• Setting:

= 0.5px

Nh = 1

80/150

Failure Predictors: Temporal consistency

• Markov Model predictor (MMp) models local trackers as two states (i.e. inlier, outlier) probabilistic automaton with transition probabilities pi(st+1 | st )

• MMp estimates the probability of being an inlier for all local trackers ) filter by

1) Static threshold s2) Dynamic threshold r

• Learning is done incrementally(learns are the transition probabilities between states)

• Can be extended by “forgetting”, which allows faster response to object appearance change

81/150

The combined outlier filter

Combining three indicators of failure:

– Local appearance (NCC)

– Neighbourhood consistency (Nh)

(similar to smoothness assumption

used in optic flow estimation)

– Temporal consistency using

a Markov Model predictor (MMp)

• Together form very a stronger

predictor than the popular

forward-backward

• Negligible computational cost (less than 10%)

T. Vojir and J. Matas. Robustifying the flock of trackers. CVWW '11,

82/150

FoT Error Prediction Bike tight box (ext. viewer)

83/150

vojir-error-prediction/mountain_bike.avi-mountain_bike.avi

FoT Error Prediction Bike loose box (ext. viewer)

84/150

vojir-error-prediction/mountain_bike_large.avi-mountain_bike_large.avi

FoT Error Prediction (ext. viewer)

85/150

vojir-error-prediction/pedestrian3.avi-pedestrian3.avi

The Mean-shift Tracker

(colour-based tracking)

86/150

Iterative tracker for histogram-based descriptors



87/150Slide credit: Robert Collins

Look for local maximum (mean shift, KLT, particle filtering, etc.)

Histogram Appearance Models

• Motivation – to track non-rigid objects, (like a

walking person), it is hard to specify an explicit 2D

parametric motion model.

• Appearances of non-rigid objects can

sometimes be modeled with color distributions


Mean Shift Algorithm

The mean-shift algorithm is an efficient

approach to tracking objects whose appearance is

defined by color.

Not limited to only color, however. Could also

use edge orientations, texture, motion – any

histogram)






…repeat

Mean Shift Tracker

Now all together

100/150

Color-based tracking

– Global description of tracked region: color histogram

– Reference histogram with B bins

set at track initialization

– Candidate histogram at current instant

gathered in region of current

image.

– At each instant

• searched around

• iterative search initialized with : meanshift-like iteration

slide credit:

Patrick Perez

101/150







image.

– At each instant

• searched around


slide credit:

Patrick Perez

102/150







image.

– At each instant

• searched around


slide credit:

Patrick Perez

103/150

– Color histogram weighted by a kernel

• Kernel elliptic support sits on the object

• Central pixels contribute more

• Makes differentiation possible

• H: “bandwidth” sym. def. pos. matrix, related to

bounding box dimensions

• k: “profile” of kernel (Gaussian or Epanechnikov)

– Histogram dissimilarity measure

• Battacharyya measure

• Symmetric, bounded, null only for equality

• 1 - dot product on positive quadrant of unitary hyper-sphere

Color distributions and similarity

slide credit:

Patrick Perez

104/150

Color distributions and similarity


Iterative ascent

– Non quadratic minimization: iterative ascent with linearizations

– Setting move to (g=-h’)

yields a simple algorithm…

106/150

Meanshift tracker

•In frame t+1

– Start search at

– Until stop

• Compute candidate histogram

• Weight pixels inside kernel support

• Move kernel

• Check overshooting

until

• If stop, else

– slide credit: Patrick Perez

107/150

Mean Shift tracking example

Feature space: 161616 quantized RGBTarget: manually selected on 1st frame

Average mean-shift iterations: 4

108

Mean Shift tracking example

D. Comaniciu, V. Ramesh, P. Meer: Kernel-Based Object Tracking TPAMI, 2003109

http://comaniciu.net/Papers/KernelTracking.pdf

Some examples of MS tracking

http://www.youtube.com/watch?v=RG5uV_h50b0

110

Pros and cons

– Low computational cost (easily real-time)

– Surprisingly robust

• Invariant to pose and viewpoint

• Often no need to update reference color model

– Invariance comes at a price

• Position estimate prone to fluctuation

• Scale and orientation not well captured

• Sensitive to color clutter (e.g., teammates in team sports)

– Local search by gradient descent

– Problems:

• abrupt moves

• occlusions

slide credit:

Patrick Perez

111/150

Variants

– Remove background corruption in reference

• Simple segmentation based on surrounding

color at initialization

• Re-estimation of foreground model

• Amounts to zero bins for colors more frequent

in surrounding than in selection

– Scale/orientation estimation

• Originally: greedy search around current scale/orientation

• Afterwards: incorporate loose spatial layout (via multiple spatial kernels or

spatial partionning with sub-models)

– Robustness to camera movement

• Robust estimation of dominant apparent motion

• Start search at previous position displaced according to dominant motion

slide credit:

Patrick Perez

112/150

Variants (2)

– Improved discrimination for improved robustness

• Selection of best color space or color combinations to distinguish foreground

from background

• Alternative or complementary features (intensity, textures, co-occurrences)

– Improved accuracy

• Coupling with more precise though more fragile tracking (fragment-based in

particular)

– Smoother similarity measure

• Kullback-Leibler

– Alternative search algorithms

• Trust region

slide credit:

Patrick Perez

113/150

Further variants

– On-the-fly adaptation

• Radical pose or illumination changes require adaptation

• Usually linear mixing with exponential forgetting

• Still open dilemma: adapt when required, not during occlusions…

– Probabilistic version

• Coupled with Kalman or particle filter for better handling of occlusions

– Automatic multi-objet tracking

• Detection of a category of objects

• Sequential tracking or batch “tracking”

• Handling of multiple trackers (exclusion principle, multi-objet occlusion)

slide credit:

Patrick Perez

114/150

Tracking by detection

115/150

YOLO: Real-Time Object Detection, https://www.youtube.com/watch?v=VOC3huqHrss

Tracking by detection and correspondence

116/150

• What if we combine object (or local feature) detector with

matching logic? Recall wide baseline stereo problem.

• The wide baseline stereo problem – the automatic estimation of a

geometric transformation (flow?) and the selection of consistent

correspondences between view pairs separated by a wide

(temporal?) baseline

MODS: Fast and Robust Method for Two-View Matching. Mishkin et. al., 2015

Tracking by detection

117/150

• If object does not move, but only camera moves, then we can track

whole scene:

THANK YOU.

Questions, please?

118/150

visual tracking course @ ucu day 1: intro to tracking...

Documents