mohammad s. al awad 985426 26-may-2008
DESCRIPTION
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems, 2005. Mohammad S. Al Awad 985426 26-May-2008. Outline. Introduction Background Audio Event? Semantic Context? Problem statement - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/1.jpg)
Toward Semantic Indexing and Retrieval Using Hierarchical Audio ModelsWei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWuMultimedia Systems, 2005
Mohammad S. Al Awad985426
26-May-2008
![Page 2: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/2.jpg)
OutlineIntroductionBackgroundAudio Event?Semantic Context?Problem statementHierarchical FrameworkModelingPerformanceIndexing and Retrieval
![Page 3: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/3.jpg)
IntroductionSemantic indexing and content
retrieval in:◦Audio: speech, music, noise and
silence◦Audiovisual: shots, dialogue and
action sceneRepresentation of high-level
query semantics◦E.g Scenes associated with semantic
meaning vs. color layouts and object positions
![Page 4: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/4.jpg)
BackgroundPrevious work concentrated on
identifying sounds like applause, gunshot, cheer or silence.
Tools used: Bayesian Network and Support Vector Machine SVM to fuse information from different sounds
Critique: isolated sounds carry less solid semantic
![Page 5: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/5.jpg)
Audio EventShort audio clip that represent
the sound of an object or eventThey can be characterized by
statistical patterns and temporal evolution
![Page 6: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/6.jpg)
Semantic ContextThe context of semantic concept
is an analysis unit that represents more reasonable granularity for multimedia content usage
Semantic Concept: gunplay scene
Semantic Context: gunshots and explosions in action movie
![Page 7: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/7.jpg)
Enhancing the problem statement?Index multimedia documents by
detecting high-level semantic contexts. To characterize a semantic context, audio events highly relevant to specific semantic concepts are collected and modeled.
Occurrence patterns of gunshot and explosion events are used to characterize “gunplay” scenes, and the patterns of engine and car-braking events are used to characterize “car chasing” scenes.
![Page 8: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/8.jpg)
Hierarchical FrameworkLow-level events, such as
gunshot, explosion, engine and car braking sounds are modeled
Based on the statistical information collected from various audio event detection results two methods are investigated to fuse this information: Gaussian mixture model (GMM) and Hidden Markov model (HMM)
![Page 9: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/9.jpg)
Hierarchical Framework (cont.)
![Page 10: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/10.jpg)
ModelingFeature extractionAudio event modelingConfidence evaluationSemantic context modeling
◦Gaussian mixture model◦Hidden Markov model
![Page 11: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/11.jpg)
Feature ExtractionExtract suitable time and
frequency domain features to build feature vector
Audio streams: 16-KHz, 16-bit mono, 400 samples, 50% overlap
![Page 12: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/12.jpg)
Feature Extraction (tools)Perceptual Features
◦STE short-time energy: is the loudness or volume
◦BER band-energy ratio: the spectrum is divided to four bands where energy of each sub band is divided by total energy
◦ZCR zero-crossing rate: average number of signal sign change in audio frame
Mel-Frequency Cepstral Coefficients MFCCFrequency Centroid (FC)Bandwidth (BW)
![Page 13: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/13.jpg)
Feature Extraction (feature vector)16-dimension feature vector
◦1(STE)+4(BER)+1(ZCR)+1(FC)+1(BW)+8(MFCC)
16-dimension feature vector◦Audio frame difference between Ai-
Ai+1
Result is 32-dimension feature vector
![Page 14: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/14.jpg)
Audio Event ModelingHidden Markov Model HMM is used to
model audio samplesEach HMM module takes the extracted
features as input◦Forward algorithm is used to compute the
log-likelihood of an audio segment with respect to each audio event
◦Baum-Welch algorithm to estimate transitions probabilities between states (physical meaning)
◦Clustering algorithm to determine model size and states
![Page 15: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/15.jpg)
Audio Event Modeling (training)HMM models: gunshot, explosion,
engine, car brakingTraining data: 100 audio events
3-10 sec representing each HMM model
![Page 16: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/16.jpg)
Confidence EvaluationTo determine how a segment is close
to an audio event, a confidence metric is calculated◦Compare in 1second step (analysis
window) the audio segment with the audio event model
◦Use log-likelihood from Forward algorithm◦Audio segment might not belong to audio
model◦Likelihood ratio test: distribution of log-
likelihood
![Page 17: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/17.jpg)
Confidence Evaluation (depicted)
These confidence scores are the input of high-level modeling and provide important clues to bridge audio event and semantic context.
![Page 18: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/18.jpg)
Semantic Context Modeling (GMM)Goal: detect high-level semantic
context based on confidence scores of audio events that are highly relevant to the semantic concept
Training data: 30 gunplay and car chasing scenes each 3-5 min are selected from 10 Hollywood action movies
Five-fold cross validation (random 24 training, 6 testing)
![Page 19: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/19.jpg)
GMM how does it work?Semantic context last for a
period of time and not all relevant audio events exists
A texture window of 5 sec is defined with 2.5 sec overlap
Go through confidence values (analysis window of 1 sec step)
Construct pseudo-semantic features
![Page 20: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/20.jpg)
GMM how does it work?Semantic context detectionIn the case of gunplay scenes, if
all the feature elements of gunshot and explosion events are located in the detection regions, it is said that the segment conveys the semantics of gunplay.
![Page 21: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/21.jpg)
Semantic Context Modeling (HMM)Critique of GMM Model:
◦Does not model the time duration density◦Segments with low or high confidence
scores due to environment sounds or sound emerge
HMM model captures the spectral variation of acoustic features in time by considering state transitions and giving different likelihood values◦Ergodic-HMM or fully connected HMM
![Page 22: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/22.jpg)
HMM how does it work?Calculate the probability of
partial observation sequence, and state i at time t given some model λ
Using Forward algorithm calculate the log-likelihood value that represent how likely a semantic context is to occur
![Page 23: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/23.jpg)
PerformanceUncertainty is avoided: aural
information tend to remain the same whether visual scene was day or night, downtown or forest
Rare to have car chasing concept without engine sound !!
Precision is high indicates high confidence of detection results
Short length evens e.g. car braking infer lower precision
False alarms i.e. incorrect detection
![Page 24: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/24.jpg)
Performance
![Page 25: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/25.jpg)
Indexing and RetrievalConcept match between aural
and visual informationIf visual information is taken into
account, characteristic consistency between different video clips with the same concept
Generalized framework: replacing audio events by visual object models. Thus, detect both audio and audiovisual
![Page 26: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/26.jpg)
Future WorkCareful design of pseudo-
semantic feature vectors to construct meta-classifier (feature selection pool)
Blind source separation (media-aesthetic rules)
![Page 27: Mohammad S. Al Awad 985426 26-May-2008](https://reader035.vdocuments.us/reader035/viewer/2022070500/568168ac550346895ddf57de/html5/thumbnails/27.jpg)
Thank you