arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies

An Uninformed Approach to Violence Detection in Hollywood Movies

*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.

Bogdan IONESCU*2,4

[email protected]

Ionuț MIRONICĂ2

[email protected]

Jan SCHLÜTER+1

[email protected]

Markus SCHEDL3

[email protected]

ARF (Austria-Romania-France) team

4Austrian Research Institute for Artificial Intelligence

1University POLITEHNICA of Bucharest

2 3

+this work was supported by the Austrian Science Fund (FWF) under project no. Z159.

2

Presentation outline

MediaEval - Pisa, Italy, 4-5 October 2012 1/13

• The approach

• Video content description & classification

• Experimental results

• Conclusions and future work

3

The approach


> challenge: find a way

to tag violence in movies;

> what approach ?

correlation matrix

(on ground truth)

e.g. movie: Harry Potter

high low

training a classifier

on ground-truth to predict

directly the violence

frames is questionable.

ArmageddonKill BillThe Wicker Man

different correlations between violence and concepts;

high variability in appearance of violent scenes from movie to movie;

4

The approach: machine learning


> approach:

low-level features

movies &

ground truth

(annotations)

frame-level

descriptors

predicting violence

violence

training & optimizing

yes/no (+ score)

mid-level prediction

training

pred. (real values)blood

…

fire

screams

…

pred.

pred.

5

The approach: machine learning


> approach: testing

low-level features mid-level prediction predicting violence

unseen

movie

blood

…

fire

screams

…frame-level

descriptors pred.

pred.

pred.

violenceyes/no

(+ score)

6

Video content description - audio


[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

standard audio features (frame-level)

f1 fn…f2

globalfeature

= mean & variance

time

+var{f2} var{fn}

7

Video content description - visual


feature descriptors (frame-level)

• Histogram of oriented Gradients (HoG) ~ counts occurrences of gradient orientation in localized portions of an image (20º per bin);

color descriptors (frame-level)

• Color naming histogram ~ project colours into 11 universal color names (black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow);

[J. van de Weijer et al. IEEE TIP’09]

[B. Ionescu et al. IEEE ICASSP’06]

visual activity (frame-level)

time

9 2high values will

account forimportant visual

changes ~ action

8

Classifier: multi-layer perceptron


- training using back-propagation,

- use 'dropout' to reduce overfitting: a fraction of units is randomly omitted for each training case so a unit cannot rely on all other units being present. [G. Hinton et al. arXiv.org’12]

512 unitsdesc. dim. 1-5 (~concept tags)

9

Experimental results: concept prediction


> validation of the concept predictor (on the 15 train movies);

*results reported for an optimum threshold

leave-one-movie-out cross-validation

*

best results for fire and explosions (prominent yellow tones), gunshots and screams.

the purely visual concepts obtain high Fscore mainly because they are rare,

blood detector not that accurate (e.g. missed most blood in “Kill Bill”),

> use concept ground truth;

10

Experimental results: violence prediction


> validation of the violence predictor (on the 15 train movies);

> input: descriptors + mid-level predictions (real numbers);

> use violence ground truth;

leave-one-movie-out cross-validation

0.23

0.41

0.3

prec. rec. F-sc.

optimal threshold

0.27

0.46

0.34

prec. rec. F-sc.

+ median filtering for predictions

optimal threshold

11

Experimental results: official runs


> segment/shot violence decision: assign the frame-wise highest prediction score + thresholding;

> segment-level results:

precision 0.28, recall 0.49, F-score 0.36, MAP@100 0.55;

> shot-level results:

results vary significantly with the movie

12

Experimental results: official runs


> shot-level comparative results:

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

DYNI-5

DYNI-1

DYNI-4

DYNI-3

TUB-5

DYNI-2

TEC-1

TUB-2

NII-5

TUB-4

TUB-1

TUB-3

NII-4

NII-1

NII-2

NII-3

LIG-2

LIG-4

LIG-3

LIG-1

TUM

-5

TUM

-3

TUM

-2

TUM

-4

TEC-2

TEC-4

TUM

-1

Shang

haiH

ongk

ong-

3

Shang

haiH

ongk

ong-

4

Shang

haiH

ongk

ong-

5

Shang

haiH

ongk

ong-

2TE

C-5

TEC-3

Shang

haiH

ongk

ong-

1ARF-

1

MAP@100

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

MAP

13

Conclusions and future work


> fair performance for a naïve attempt to violence detection;

> future work:

investigate whether the concept predictions actually helped,

investigate contribution of modalities, investigate dropout vs. classic learning.

> a high baseline to be challenged by more sophisticated approaches;

14

thank you !


any questions ?

arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies

Technology