Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*
Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems
Clustering and Synchronizing MulG-‐Camera Video
via Audio FingerprinGng
CCRMA DSP Seminar, November 13th 2012
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
2
IntroducGon
3
• IdenGfy and synchronize mulGple videos of the same event
Video & Audio
A
B
C
D
E B
D
A
C
E
MoGvaGon
• ProliferaGon of mobile devices
• MulGple videos of a single event common – Moments in history – Weddings, concerts, speeches, film sets
• Desired to easily edit video together – Grouping/Clustering (Manual) – SynchronizaGon (Manual, Hardware)
4
• Dual System Workflow • 1 Videographer • 1 Sound Engineer
• MulG-‐Camera Workflow • 2+ Videographer • 1+ Sound Engineer
TradiGonal Video Capture
5
Crowd-‐Sourced MulG-‐Camera Video
6
• 1 Wedding ≈ 300 guest ≈ 100 smartphones/cameras ≈ 10+ videos of “I do”
• 1 concert ≈ 15,000 people ≈ 5,000 smartphones ≈ 100+ of video clips/song ≈ 1000+ video clips/concert
• 1 presidenGal speech ≈ 200,000 people ≈ 70,000 smartphones ≈ 10,000+ videos
Demo Video
• Taylor Swih’s “Fearless”
7
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
8
General Approach
• Use audio – Typically more “global” – Allows visually disjoint video
• Time-‐difference-‐of-‐arrival esGmaGon – For each pair of clips in collecGon, compute Gme offset which best synchronizes the given pair using standard correlaGon
– Use correlaGon signals to decide if the two files should match or not
9
Problems
• ComputaGonally expensive
• No accurate (straighnorward) clustering method • Not robust
Audio FingerprinGng
• Short-‐duraGon signatures via feature extracGon • Finds idenGcal (or similar) matches of unknown clip with DB
• Hash fingerprints for fast search and retrieval • Shazam, SoundHound, Philips, Gracenote, etc. • See [Wang 2003] & [Haitsma and Kalker 2003]
11
Audio FingerprinGng for MulG-‐Camera
• Slightly different problem – Group all clips in DB (mulGple matching) – Time synchronize all clips within each group
• Audio-‐fingerprinGng for mulG-‐camera – Principal of most methods yield sync offset – Robust and fast! – IniGal work over the last few years [Shrestha et al. 2007] & [Kennedy and Naaman 2009]
12
Proposed Method
1. Non-‐Linear Transform (FingerprinGng Step) 2. Time-‐Difference-‐Of-‐Arrival EsGmaGon 3. Clustering 4. SynchronizaGon Refinement 5. Efficient ComputaGon
13
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
14
Non-‐Linear (Landmark) Transform
• Convert Gme-‐domain audio signal into a high-‐dimensional, sparse, binary landmark signal
15
...
. . .
x(t) 2 R
...
. . .
Landmark Transform
tL(t, h)
x(t)
L(t, ·) 2 {0, 1}N
Landmarks
• Spectral peak pairs as landmarks [Wang 2003] – Short-‐Gme Fourier transform – Landmark = [f1, f2, Δt] + absolute Gme offset – Place each landmark in appropriate locaGon in
16
L(t1, h) = 1
...
. . . t1
h
L(t, h)
(t1, h = [f1t1 , f
2t2 , t2 � t1])
• With a large number of peaks, peak pairs are created in a limited Gme-‐frequency range
17
Landmarks as ConstellaGons Fr
eque
ncy
(kH
z)
Time (secs)
Spectrogram
1 2 3 4 5 60
0.5
1
1.5
2
2.5
3
3.5
4
Student Version of MATLAB
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4Spectral Peak Onsets
Freq
uenc
y (k
Hz)
Time (secs)
Student Version of MATLAB
• Short-‐Gme Fourier transform • Leaky integrator peak detector for each FFT bin
18
Simple Frequency Peak Detector
if |X(f)| > �f
else
�f = �f � (1� e�1/(⌧ffs))�f
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Leaky Peak Detector λf
Ampl
itude
Time (Sec)
|X(f)|λfPeaks
Student Version of MATLAB
�f = |X(f)|
Leaky Peak Detector
//Peak Onset
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
19
Time-‐Difference-‐Of-‐Arrival EsGmaGon
• Pairwise cross-‐correlaGon method – Correlate each track with each other – Find argmax for offset – i.e. Matched filter
Rij(t) =P1
⌧=�1 xi(⌧)xj(t+ ⌧)
• Landmark cross-‐correlaGon
• Time-‐Difference-‐Of-‐Arrival EsGmaGon
Landmark Cross-‐CorrelaGon
21
ˆtij = argmaxt RLi,Lj (t)
RLi,Lj (t) =P1
⌧=�1 Li(⌧)TLj(t+ ⌧)
Time-‐Difference-‐Of-‐Arrival EsGmaGon
22
(a) Normalized absolute Gme-‐domain cross-‐correlaGon.
(b) Normalized landmark cross-‐correlaGon.
Rxi,xj (t)
RLi,Lj (t)
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
23
Clustering
• AgglomeraGve Clustering – IniGalize each clip as a separate cluster and merged into successively larger clusters
– Merge most confidence matches first • Confidence as funcGon of stats from best potenGal sync • Reject unconfident merges based on decision rules
24
A
B
C
D
E
B
D
A
C
A
C E
• Maximum of correlaGon • Mean and variance of cross-‐correlaGon • Percentage of total matching landmarks in the overlap region • Overall Gme range defined by the set of matching landmarks • Overlap region length • Ignore overly common landmarks (i.e. 60Hz)
Merge Decision Rules
25
o
ro
• Groups w/pairwise sync offset and confidence scores
Clustering Output
26
t B DB 0 -11.5D 11.5 0
Offsets (seconds) S B DB - 23D 23 -
Confidence Score
t A C EA 0 -5 10C 5 0 -E -10 - 0
Offsets (seconds) Confidence Score
S A C EA - 30 20C 30 - -E 20 - -
B
D
A
C E
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
27
SynchronizaGon Refinement
• Refinement is required for clusters of three or more if: 1. Inconsistent pairwise TDOA esGmates do not saGsfy all
triangle equaliGes within a cluster 2. One or more TDOA esGmates within any cluster is
unknown caused by non-‐overlapping clips
28
tAC 6= tAB + tBC
Implied by other esGmates
(a) Case 1 (a) Case 2
Slightly off
Greedy Match-‐and-‐Merge
1. Find the most confident TDOA esGmate within the cluster in terms of or similar confidence score.
2. Merge the landmark signals and . First Gme shih by and then mulGply or add the two signals together (depending on the desired effect).
3. Update the remaining TDOA esGmates and confidence scores to respect the file merge.
4. Repeat unGl all files within the cluster are merged. 29
tijRLi,Lj
Li LjLj tij
Greedy Match-‐and-‐Merge Graphically
30
A D
B
C
A D
BC
D
ABC ABCD
(a) IniGal Clusters (b) IteraGon 1
(c) IteraGon 2 (d) IteraGon 3
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
31
Efficient ComputaGon
• Leverage knowledge of landmark signal and perform “sparse” cross-‐correlaGon in a special way (fingerprinGng)
• Use some form of associaGve array, map, or dicGonary to store landmarks and compute all pairwise correlaGons – Direct arrays – Binary tree – Hash table
32
Map Structure I
• Create map structure of all landmarks – Key = (f1, f2, Δt) – Value = (FileID, AbsoluteTimeOffset)
• Matching files will have idenGcal landmark • Difference between AbsoluteTimeOffset of gives sync
33
A
B
C
D
E
t1 t2 t3 t4 t5
…
…
…
A,t1 E,t5
C,t3 A,t1
B,t2 D,t4
Map Structure II
• Convert map structure to pairwise correlaGons • For each landmark, compute all pairwise Gme differences
and store in the appropriate pairwise correlaGon
34
Δt = 4
A vs. E
C vs. A
Δt = 2
B vs. D
Δt = 2
…
…
…
A,t1 E,t5
C,t3 A,t1
B,t2 D,t4
General ComputaGonal Benefit
• Naïve pairwise correlaGons 1. pairwise correlaGons, number of files 2. Each correlaGon , samples in file
• DrasGcally reduces the computaGonal cost 1. Eliminates pairwise correlaGons for clips that don’t match 2. Makes each pairwise correlaGon faster
• Computes correlaGon computaGon for only the salient parts (landmarks) of audio
35
P !2(P�2)!
O(N log(N))
P =
N =
Ideal Case
1. Pairwise Comparisons – All landmarks are unique its group – Only performs pairwise correlaGons within each group – For large # groups/small # clips, this is savings huge
2. Single pairwise correlaGon – Only correlate points with matching landmarks, no computaGon for 0s
– Ideal case with no false posiGve matches results in a cost, with = number of matching landmarks
36
O(M) M
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
37
EvaluaGon Metrics
• Performance measures – Precision, Recall, F1 score – Computed on pairwise matches of final clusters
• ComputaGonal cost – Compute Gme (seconds) – Throughput (seconds processed/seconds of compute Gme)
• Benchmark – Comparison to commercial mulG-‐camera sohware Plural Eyes
38
Precision, Recall, and F1
• Precision – fracGon of esGmated pairwise merges retrieved that are correct
• Recall – fracGon of correct pairwise merges retrieved
• F1 score – harmonic mean of precision and recall
39
B D
A
C E
B D
A
C
E
-‐ A-‐E A-‐B A-‐D E-‐B E-‐D B-‐D
A-‐E A-‐C C-‐E
B-‐D
P = 2/6 R = 2/5 F1 = 8/22
EsGmated Clusters Ground Truth Clusters
2PR/(P +R)
Datasets
• Speech (180 clips from film set) – Average length 20-‐40 seconds – 54 clusters of one file – 54 clusters of two files – 6 clusters of three files
• Music (23 clips from live music concerts) – Average length 3-‐5 minutes – 1 cluster of 7 files – 2 clusters of 8 files
40 All audio files are downsampled to common sample rate of 8kHz for efficiency
Precision, Recall, and F1 Results
• As expected from using the feature extracGon of [Wang 2003]
41
ComputaGonal Cost
42 Rough Timing on MacBook Pro, OSX 10.6.8, 2.66 GHz Intel Core i7, unopGmized C++.
Speech Music Speech + Music
Proposed 47.0 41.1 90.1Traditional 1550 197 3600
(a) Computation time (s).
Speech Music Speech + Music
Proposed 164.6 146.5 152.7Traditional 5.0 30.5 3.9
(b) Throughput (s/s).
≈ linear not linear
Benchmark (Speech Dataset)
• Accuracy Measures – Proposed method F1 ≈ 99% – Plural Eyes 2.1.0 F1 ≈ 95%
• ComputaGonal Cost – Proposed method ≈ 3 minutes – Plural Eyes 1.2.0 ≈ 6 hours – Plural Eyes 2.1.0 ≈ 2 hours – Plural Eyes 2.1.0 (hard) ≈ 10 hours
43
Outline I IntroducGon
II Proposed Method
-‐ Non-‐Linear Transform
-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon
-‐ Clustering
-‐ SynchronizaGon Refinement
-‐ Efficient ComputaGon III EvaluaGon IV Conclusions
44
Future Work & Research DirecGons
• Video analog to photo “sGtching” – Crowd-‐sourced mulG-‐camera video – Easily change both video and audio viewpoint
• Denoising/improving audio quality from groups
• SpaGal audio processing – Use for Gme delay esGmaGon – Large-‐scale beamforming, direcGonal listening, etc.
45
Conclusions
• Method of clustering and sync of mulG-‐camera videos using audio – Non-‐Linear Transform – Time-‐Difference-‐Of-‐Arrival EsGmaGon – Clustering – SynchronizaGon Refinement – Efficient ComputaGon
• Fast and accuracy
46
References
• Jaap Haitsma and Ton Kalker, “A Highly Robust Audio FingerprinGng System With an Efficient Search Strategy,” Journal of New Music Research , vol. 32, no. 2, 2003.
• A.L. Wang, “An Industrial-‐Strength Audio Search Algorithm,” in Proc. 4th Int. Symposium on Music InformaGon Retrieval (ISMIR) , October 2003.
• P. Shrestha, M. Barbieri, and H. Weda, “SynchronizaGon of mulG-‐camera video recordings based on audio,” in Proc. 15th Intl. Conf. on MulGmedia , 2007.
• L. Kennedy and M. Naaman, “Less talk, more rock: automated organizaGon of community-‐contributed collecGons of concert videos,” in Proc. 18th Int. Conf. on World Wide Web , 2009.
• D. Ellis (2009). “Robust Landmark-‐Based Audio FingerprinGng”, h|p://labrosa.ee.columbia.edu/matlab/fingerprint
47
Demo Video
• Dave Ma|hews Band’s “Everyday”
48
Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*
Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems
Clustering and Synchronizing MulG-‐Camera Video
via Audio FingerprinGng
CCRMA DSP Seminar, November 13th 2012