arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies
DESCRIPTION
TRANSCRIPT
An Uninformed Approach to Violence Detection in Hollywood Movies
*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
Bogdan IONESCU*2,4
Ionuț MIRONICĂ2
Jan SCHLÜTER+1
Markus SCHEDL3
ARF (Austria-Romania-France) team
4Austrian Research Institute for Artificial Intelligence
1University POLITEHNICA of Bucharest
2 3
+this work was supported by the Austrian Science Fund (FWF) under project no. Z159.
2
Presentation outline
MediaEval - Pisa, Italy, 4-5 October 2012 1/13
• The approach
• Video content description & classification
• Experimental results
• Conclusions and future work
3
The approach
MediaEval - Pisa, Italy, 4-5 October 2012 2/13
> challenge: find a way
to tag violence in movies;
> what approach ?
correlation matrix
(on ground truth)
e.g. movie: Harry Potter
high low
training a classifier
on ground-truth to predict
directly the violence
frames is questionable.
ArmageddonKill BillThe Wicker Man
different correlations between violence and concepts;
high variability in appearance of violent scenes from movie to movie;
4
The approach: machine learning
MediaEval - Pisa, Italy, 4-5 October 2012 3/13
> approach:
low-level features
movies &
ground truth
(annotations)
frame-level
descriptors
predicting violence
violence
training & optimizing
yes/no (+ score)
mid-level prediction
training
pred. (real values)blood
…
fire
screams
…
pred.
pred.
5
The approach: machine learning
MediaEval - Pisa, Italy, 4-5 October 2012 4/13
> approach: testing
low-level features mid-level prediction predicting violence
unseen
movie
blood
…
fire
screams
…frame-level
descriptors pred.
pred.
pred.
violenceyes/no
(+ score)
6
Video content description - audio
MediaEval - Pisa, Italy, 4-5 October 2012 5/13
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
• Linear Predictive Coefficients,
• Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
• Zero-Crossing Rate,
+ variance of each feature over a certain window.
• spectral centroid, flux, rolloff, and kurtosis,
standard audio features (frame-level)
f1 fn…f2
globalfeature
= mean & variance
time
+var{f2} var{fn}
7
Video content description - visual
MediaEval - Pisa, Italy, 4-5 October 2012 6/13
feature descriptors (frame-level)
• Histogram of oriented Gradients (HoG) ~ counts occurrences of gradient orientation in localized portions of an image (20º per bin);
color descriptors (frame-level)
• Color naming histogram ~ project colours into 11 universal color names (black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow);
[J. van de Weijer et al. IEEE TIP’09]
[B. Ionescu et al. IEEE ICASSP’06]
visual activity (frame-level)
time
9 2high values will
account forimportant visual
changes ~ action
8
Classifier: multi-layer perceptron
MediaEval - Pisa, Italy, 4-5 October 2012 7/13
- training using back-propagation,
- use 'dropout' to reduce overfitting: a fraction of units is randomly omitted for each training case so a unit cannot rely on all other units being present. [G. Hinton et al. arXiv.org’12]
512 unitsdesc. dim. 1-5 (~concept tags)
9
Experimental results: concept prediction
MediaEval - Pisa, Italy, 4-5 October 2012 8/13
> validation of the concept predictor (on the 15 train movies);
*results reported for an optimum threshold
leave-one-movie-out cross-validation
*
best results for fire and explosions (prominent yellow tones), gunshots and screams.
the purely visual concepts obtain high Fscore mainly because they are rare,
blood detector not that accurate (e.g. missed most blood in “Kill Bill”),
> use concept ground truth;
10
Experimental results: violence prediction
MediaEval - Pisa, Italy, 4-5 October 2012 9/13
> validation of the violence predictor (on the 15 train movies);
> input: descriptors + mid-level predictions (real numbers);
> use violence ground truth;
leave-one-movie-out cross-validation
0.23
0.41
0.3
prec. rec. F-sc.
optimal threshold
0.27
0.46
0.34
prec. rec. F-sc.
+ median filtering for predictions
optimal threshold
11
Experimental results: official runs
MediaEval - Pisa, Italy, 4-5 October 2012 10/13
> segment/shot violence decision: assign the frame-wise highest prediction score + thresholding;
> segment-level results:
precision 0.28, recall 0.49, F-score 0.36, MAP@100 0.55;
> shot-level results:
results vary significantly with the movie
12
Experimental results: official runs
MediaEval - Pisa, Italy, 4-5 October 2012 11/13
> shot-level comparative results:
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
DYNI-5
DYNI-1
DYNI-4
DYNI-3
TUB-5
DYNI-2
TEC-1
TUB-2
NII-5
TUB-4
TUB-1
TUB-3
NII-4
NII-1
NII-2
NII-3
LIG-2
LIG-4
LIG-3
LIG-1
TUM
-5
TUM
-3
TUM
-2
TUM
-4
TEC-2
TEC-4
TUM
-1
Shang
haiH
ongk
ong-
3
Shang
haiH
ongk
ong-
4
Shang
haiH
ongk
ong-
5
Shang
haiH
ongk
ong-
2TE
C-5
TEC-3
Shang
haiH
ongk
ong-
1ARF-
1
MAP@100
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
MAP
13
Conclusions and future work
MediaEval - Pisa, Italy, 4-5 October 2012 12/13
> fair performance for a naïve attempt to violence detection;
> future work:
investigate whether the concept predictions actually helped,
investigate contribution of modalities, investigate dropout vs. classic learning.
> a high baseline to be challenged by more sophisticated approaches;
14
thank you !
MediaEval - Pisa, Italy, 4-5 October 2012 13/13
any questions ?