a compact and discriminative face track descriptorvgg/publications/2014/... · very simple yet...
Post on 25-Jul-2020
2 Views
Preview:
TRANSCRIPT
Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
A Compact and Discriminative!Face Track Descriptor
Recognising and verifying faces in videos 2
Recognition Verification
same
different
VF2: a new compact face track descriptor
▶Discriminative!▶Useful for different tasks (Recognition, Verification)!▶Extremely compact
3
Face track: sequence of face detections in consecutive frames.
face track descriptor
Large scale face retrieval
▶ Example of a typical target dataset!▶ 5 years of evening news programs!▶ 10,000 hrs of broadcast !▶ 20 Million frames, !▶ 2.1 Million face tracks!▶ Real time performance
4
http://www.robots.ox.ac.uk/~vgg/research/on-the-fly/
▶ 30 frames per track on average!▶ Typical 4000D descriptor → 1 TB!▶ Our descriptor → 270 MB!
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
5
1. Dense feature computation
▶ Input: a face track!▶ Aligned or unaligned!▶ No facial landmarks required (eyes, nose, etc.)!
▶ Output: a set of local features !▶ Extracted from all frames!▶ Dense RootSIFT at multiple scales!▶ 64-D PCA
6
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
7
GMM
[Perronnin et al. ECCV 2012]
2. Fisher Vector encoding 8
first and second order statistics
FV encoding+ sqrt-L2
normalisation
� =
2
6666666664
v1u1v2u2...vKuK
3
7777777775 uk =1
Mp2⇡k
MX
i=1
�k(xi )
✓xi � µk
�i� 1
◆2
vk =1
Mp⇡k
MX
i=1
�k(xi )xi � µk
�i
xi
Gaussians!(μk,Σk)
µk
xi�k(xi )
AssignmentHard
Dense SIFT
Gaussian components as part detectors
2. Fisher Vector Encoding 9
2
66664 xW � 1
2yH � 1
2
3
77775
Spatial (x,y) Augmentation
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
10
3. Video and jittered pooling
▶ Typically each frame is pooled independently!▶ Complex inference procedures combining multiple descriptors!▶ Large memory footprint
11
[Sivic et al. CVPR 09, Everingham et. al IVC 09,, Wolf et al. CVPR 2011]
3. Video and jittered pooling
▶ Single descriptor per track!▶ Smaller memory footprint!▶ Easy to use!▶ Improved performance
12
[Application to Action Recognition: Oneata, Verbeek, Schmid ICCV 2013]
3. Video and jittered pooling
▶ Data augmentation!▶ Data augmentation without training set increase!▶ Improvement in the performance
13
[Paulin et al. CVPR 2014]
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
14
Learn to discriminate faces
4. Metric Learning 15
d2W (x, y) = kW x�W yk2 < b
same person
d2W (u, v) = kWu�W vk2 > b
x y uv
different people
d2W (x, y) = kW x�W yk2
[Simonyan, Parkhi, Vedaldi, Zisserman BMVC 2013]
W
x
=
learnt projectionFisherVector
z
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
16
Parseval Tight Frame
5. Binarisation
▶ Low-dimensional real-valued descriptor → high dimensional binary !▶ 4x decrease in memory footprint (128D real → 1024D binary)!▶ Fast distance computation!▶ Alternative binarisation methods could be used
17
q ⨉ m!Columns from a
random rotation matrix
sign= q
q bits only
0101010
[Jégou et al. ICASSP 2012, Simonyan et al. PAMI 2014]
real-valued descriptor
m⨉
U z sign(Uz)zU
q
d2W (x, y)
[011001010]
Outline
1. Dense feature computation!
2. Fisher Vector encoding!
3. Video and jittered pooling!
4. Compression by metric learning!
5. Binarisation!
6. Results
18
Face Verification
YouTube Faces Dataset 19
▶ Face verification in videos!▶ 3,425 videos of 1,595 celebrities!▶ Videos collected from internet!▶ Wide pose, expression and illumination variation!▶ 10 splits of 600 pairs of videos!
▶ Restricted setting: Use provided pairs!▶ Unrestricted setting: Free to form own pairs.
differentsame
[Wolf, Hassner, Moaz CVPR 2011]
Image Pool (Soft assignment FV)
Video Pool (Soft assignment FV)
Video Pool hard asignment fv
Video Pool + Jittered Pool
Video Pool. + Binar. 1024 bit + jitt.
Video Pool. + Joint sim. + jitt.
Error0 4.5 9 13.5 18
12.3
13.4
14.2
16.2
15
17.3
Face Verification
YouTube Faces Dataset 20
Face Verification
YouTube Faces Dataset 21
MGBS & SVM-APEM FUSION
STFRD & PMMLVSOF & OSS (Adaboost)
DDML (Combined)VF 1024D (binary)
VF 256DDeep Face (facebook.com)
Error0 5.5 11 16.5 22
8.612.3
13.418.5
2019.9
21.421.2
Requires additional training data.
2
2
Weakly supervised face classification
Oxford Buffy Dataset 22
▶ “Buffy The Vampire Slayer”!▶ Face tracks from 7 episodes of season 5.!▶ Both frontal and profile detections!▶ Weak supervision from transcript and subtitles!▶ Multi Class classification for every episode
[Everingham et al. IVC 2009, Sivic et al. CVPR 2009]
Weakly supervised classification
Oxford Buffy Dataset 23
Sivic et al. (HOG RBF MKL)
VF ( GMMs trained on Buffy )
VF ( GMMs trained on YTF )
VF ( GMMs trained on YTF ) + Jitt. Pool 1024D
VF ( GMMs trained on YTF 2048b)
Avg. AP0.79 0.808 0.825 0.843 0.86
0.82
0.86
0.8
0.81
0.812
2
2
2
Very simple yet powerful face track descriptor!
Recap 24
▶Track descriptor in 128 bytes!▶Face landmarks and alignment not required!▶One descriptor per track!!
▶State of the art/comparable results on multiple tasks!▶YouTube Faces Dataset!▶Oxford Buffy Dataset!
!▶ Can be trained with very small amount of data!▶ Extremely easy to compute!
!▶ Code online soon.
Questions?
top related