structured tv shows ``you have been chopped'' · semantic segmentation of tv programs...

3
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 Structured TV Shows — “You Have Been Chopped” 1. Introduction With an increasing demand for rich, immersive and inter- active experiences, viewers want to engage with their fa- vorite TV shows to learn and discuss the characters, back- stories and potential arcs in the storyline. Many shows have long story-line arcs spanning multiple episodes and even seasons leading to complex narratives such as “Game of Thrones”. On the other end of the spectrum, we have highly-structured shows which have a tightly scripted for- mat which does not vary much across the episodes and even multiple seasons — examples include the TV shows “Shark Tank”, “Chopped”, “The Voice”, and “American Ninja Warrior” involving entrepreneurs, chefs, singers, and athletes respectively. The structure arises due to the shows using the same venues, music themes for specific transi- tions and specific words for contextual changes. Chapter- ing is critical to obtain the contextual information. Most existing literature focus on video segmentation or chaptering using audio-visual analysis (Sargent et al., 2014; Aoki, 2006). They are not robust to generate semantic chapters. Previous work (Yamamoto et al., 2014) presents a semantic segmentation of TV programs based on detection of corner-subtitles, but the corresponding semantic infor- mation is limited to Japanese TV programs. To build a context-sensitive video question answering sys- tem, we leverage the structure of the show to extract meta- data by Closed Caption (CC) analysis and visual analy- sis. In particular, we incorporate semantics through do- main knowledge, in the form of transition phrases, in or- der to chapter structured TV shows. The metadata are then used for indexing and retrieval of entities of interest with respect to chapters, and the corresponding chopped seg- ments containing the entities are presented to end users. We build a video question answering system for the TV show “Chopped”, and the system is illustrated in Figure 1. 2. Semantic Analysis of Videos Data Preparation. We processed 60 videos of the “Chopped” consisting of various episodes in 6 seasons - (27, 23, 22, 21, 15, 14). CC segments were extracted from the raw videos using the ffmpeg package. We use information about the show such as cooking ingredi- ents, episode title, names of the judges, participating con- testants and the winners of the various rounds extracted from Wikipedia. We extract food-related phrases from the CC using Stanford Core NLP (Manning et al., 2014) and Word2Vec (Mikolov et al., 2013). Chopped segments. We segment the videos by chap- ters with very little domain knowledge but leveraging the shared structured across Chopped episodes. First piece of domain knowledge is that each episode consists of three chapters, namely the cooking of appetizers, entrees, and desserts. Also, we obtain seed examples of what the host says during a chapter transition; e.g., when host says “Please open your baskets”, the ingredient basket for the upcoming chapter is revealed – this is used as a signal for the start of the chapter. Similarly, “Chef X, you have been chopped” is a phrase uttered at the end of a chapter. Only with this domain knowledge, we create a small set of reg- ular expressions to catch the CC text signaling a chapter transition. For i th episode, we find all captions that match the regular expression, which produces a set of candidate start and end markers, S i and E i , respectively. While the regular expres- sions are expected to accurately find the start and end mark- ers in most episodes, typos and linguistic variations might create noise (e.g., the second start marker is between the first start and end markers). Due to this, getting the actual interval for each chapter is not always trivial. To select the optimal chapter intervals, we rely on reference episodes, where we have the expected sequence of start-end markers for each chapter (i.e., for ‘Chopped’, this means that we have three start and three end markers, for appetizer, entree and dessert chapters). We find the earliest start time and lat- est end time for each chapter, across all episodes, denoted by (S app ,E app ); (S entr ,E entr ); (S dsrt ,E dsrt ). Given these reference intervals, as well candidate start-end markers of the i th episode, we compute the optimal start time of each chapter by finding the earliest start marker that falls into the reference interval for that chapter. Similarly, we compute the optimal end time by finding the latest such end marker: S i app = min{s|s S i s (S app ,E app )} E i app = max{e|e E i e (S app ,E app )} OCR on Selected Scenes. Since we know where the contents of the basket are revealed and the ingredi- ents are not included in the closed captions, we rec-

Upload: others

Post on 29-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured TV Shows ``You Have Been Chopped'' · semantic segmentation of TV programs based on detection of corner-subtitles, but the corresponding semantic infor-mation is limited

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Structured TV Shows — “You Have Been Chopped”

1. IntroductionWith an increasing demand for rich, immersive and inter-active experiences, viewers want to engage with their fa-vorite TV shows to learn and discuss the characters, back-stories and potential arcs in the storyline. Many showshave long story-line arcs spanning multiple episodes andeven seasons leading to complex narratives such as “Gameof Thrones”. On the other end of the spectrum, we havehighly-structured shows which have a tightly scripted for-mat which does not vary much across the episodes andeven multiple seasons — examples include the TV shows“Shark Tank”, “Chopped”, “The Voice”, and “AmericanNinja Warrior” involving entrepreneurs, chefs, singers, andathletes respectively. The structure arises due to the showsusing the same venues, music themes for specific transi-tions and specific words for contextual changes. Chapter-ing is critical to obtain the contextual information.

Most existing literature focus on video segmentation orchaptering using audio-visual analysis (Sargent et al., 2014;Aoki, 2006). They are not robust to generate semanticchapters. Previous work (Yamamoto et al., 2014) presents asemantic segmentation of TV programs based on detectionof corner-subtitles, but the corresponding semantic infor-mation is limited to Japanese TV programs.

To build a context-sensitive video question answering sys-tem, we leverage the structure of the show to extract meta-data by Closed Caption (CC) analysis and visual analy-sis. In particular, we incorporate semantics through do-main knowledge, in the form of transition phrases, in or-der to chapter structured TV shows. The metadata are thenused for indexing and retrieval of entities of interest withrespect to chapters, and the corresponding chopped seg-ments containing the entities are presented to end users. Webuild a video question answering system for the TV show“Chopped”, and the system is illustrated in Figure 1.

2. Semantic Analysis of VideosData Preparation. We processed 60 videos of the“Chopped” consisting of various episodes in 6 seasons- (27, 23, 22, 21, 15, 14). CC segments were extractedfrom the raw videos using the ffmpeg package. Weuse information about the show such as cooking ingredi-ents, episode title, names of the judges, participating con-

testants and the winners of the various rounds extractedfrom Wikipedia. We extract food-related phrases from theCC using Stanford Core NLP (Manning et al., 2014) andWord2Vec (Mikolov et al., 2013).

Chopped segments. We segment the videos by chap-ters with very little domain knowledge but leveraging theshared structured across Chopped episodes. First piece ofdomain knowledge is that each episode consists of threechapters, namely the cooking of appetizers, entrees, anddesserts. Also, we obtain seed examples of what thehost says during a chapter transition; e.g., when host says“Please open your baskets”, the ingredient basket for theupcoming chapter is revealed – this is used as a signal forthe start of the chapter. Similarly, “Chef X, you have beenchopped” is a phrase uttered at the end of a chapter. Onlywith this domain knowledge, we create a small set of reg-ular expressions to catch the CC text signaling a chaptertransition.

For ith episode, we find all captions that match the regularexpression, which produces a set of candidate start and endmarkers, Si and Ei, respectively. While the regular expres-sions are expected to accurately find the start and end mark-ers in most episodes, typos and linguistic variations mightcreate noise (e.g., the second start marker is between thefirst start and end markers). Due to this, getting the actualinterval for each chapter is not always trivial. To select theoptimal chapter intervals, we rely on reference episodes,where we have the expected sequence of start-end markersfor each chapter (i.e., for ‘Chopped’, this means that wehave three start and three end markers, for appetizer, entreeand dessert chapters). We find the earliest start time and lat-est end time for each chapter, across all episodes, denotedby (Sapp,Eapp); (Sentr,Eentr); (Sdsrt,Edsrt). Given thesereference intervals, as well candidate start-end markers ofthe ith episode, we compute the optimal start time of eachchapter by finding the earliest start marker that falls into thereference interval for that chapter. Similarly, we computethe optimal end time by finding the latest such end marker:

Siapp = min{s|s ∈ Si ∧ s ∈ (Sapp, Eapp)}

Eiapp = max{e|e ∈ Ei ∧ e ∈ (Sapp, Eapp)}

OCR on Selected Scenes. Since we know where thecontents of the basket are revealed and the ingredi-ents are not included in the closed captions, we rec-

Page 2: Structured TV Shows ``You Have Been Chopped'' · semantic segmentation of TV programs based on detection of corner-subtitles, but the corresponding semantic infor-mation is limited

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Structured Shows — “You Have Been Chopped”

ognize the characters on screen by an optical character

Figure 1. Our system screenshot for input query “bok choy” and retrieved video compilation.

recognition (OCR) algo-rithm (Smith, 2007), namedTesseract OCR, on that mo-ment. Specifically, we runthe OCR only on possibletext region candidates forcomputational efficiency. Theregion candidates are found bycombining the morphologicalfiltering and the binarizationof the images with the Otsuthreshold (Otsu, 1979). TheTesseract OCR fine-tunes theword regions in the candidateregions by line finding, baselinefitting and proportional wordfinding. Then, it encodes eachcharacter into a segment featureof polygonal approximationand efficiently classifies it byquantization of the featurevector and map it into a hashcode. Once we have the OCRoutputs for multiple frames ofthe food basket ingredients, we clean up and find the foodphrases by (a) removing strings that appear in fewer thanten frames, and (b) keeping only phrases that are in thevocabulary of Google word embeddings.

Indexing Food Mentions. Noun phrases that appear inclosed captions across all episodes are found using Stan-ford CoreNLP. We create an inverted index, so that givena noun phrase, we can access all time intervals of eachepisode where it was mentioned. We also create a filterspecifically for Chopped, where we want to distinguishfood phrases from other noun phrases. In order to auto-mate this, we take advantage of pre-trained word embed-dings (Mikolov et al., 2013). The vector of a noun phraseis the average of the vectors of each word (ignoring stopwords). We then compare this to the vector of “food”, andkeep phrases with a cosine similarity beyond a threshold(e.g. 0.20 in this case.)

Search Interface. As a use case of the metadata that weextracted from Chopped, we built an application wherethe user can query an ingredient or dish, from which wegenerate a custom “Chopped” compilation video. Our ap-proach uses the inverted index of food phrases to retrieveall time intervals in which the query was mentioned, acrossall episodes. Since each time interval is very short (fewseconds on average), we try to merge nearby mentions asmuch as possible, so that the compiled video is not choppedinto too many pieces. For this, for each episode, we sort allretrieved intervals, and merge consecutive ones if the dif-

ference is less than 5 seconds — we keep doing this until

there are no intervals closer than 5 seconds apart. In or-der to add more context, we then expand each interval by3 seconds to the left and right. While this heuristic worksfine in many cases, selecting the optimal expansion pointis a non-trivial problem. We can use the closed captionsto find sentence boundaries, as well as visual features todetect shot boundaries.

3. Conclusions and Future WorkDomain knowledge in the form of a small set of key phrasescan provide semantic content about a TV series, and whenused in conjunction with a data-driven approach to alignthe structure across episodes, this can help with chapter-ing each episode. This was demonstrated successfully forthe TV show “Chopped”, where we built a query engineto search for food phrases across episodes. We would liketo generalize the approach to other structured shows suchas “Shark Tank”, “American Ninja Warrior” among oth-ers. Also, using video information such as shot-boundaryand audio information such as speaker diarization can helpwith finer segmentation of relevant scenes (Knyazeva et al.,2015).

ReferencesAoki, Hisashi. High-speed topic organizer of tv shows us-

ing video dialog detection. Systems and Computers inJapan, 37(6), 2006.

Page 3: Structured TV Shows ``You Have Been Chopped'' · semantic segmentation of TV programs based on detection of corner-subtitles, but the corresponding semantic infor-mation is limited

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Structured Shows — “You Have Been Chopped”

Knyazeva, Elena, Wisniewski, Guillaume, Bredin, Herve,and Yvon, Francois. Structured prediction for speakeridentification in tv series. In Interspeech 2015, 16th An-nual Conference of the International Speech Communi-cation Association, Dresden, Germany, September 2015.

Manning, Christopher D., Surdeanu, Mihai, Bauer, John,Finkel, Jenny, Bethard, Steven J., and McClosky,David. The Stanford CoreNLP natural languageprocessing toolkit. In Association for Computa-tional Linguistics (ACL) System Demonstrations, pp.55–60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010.

Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean,Jeffrey. Efficient estimation of word representationsin vector space. CoRR, abs/1301.3781, 2013. URLhttp://arxiv.org/abs/1301.3781.

Otsu, N. A threshold selection method from gray-level his-tograms. IEEE Trans. Sys., Man., Cyber., 1979.

Sargent, Gabriel, Hanna, Pierre, and Nicolas, Henri. Seg-mentation of music video streams in music piecesthrough audio-visual analysis. In 2014 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pp. 724 – 728, 2014.

Smith, R. An Overview of the Tesseract OCR Engine. InProc. Ninth Int. Conference on Document Analysis andRecognition (ICDAR), pp. 629–633, 2007.

Yamamoto, Koji, Takayama, Shunsuke, and Aoki, Hisashi.Semantic segmentation of tv programs using corner-subtitles. In 2009 IEEE 13th International Symposiumon Consumer Electronics, pp. 205 – 208, 2014.