structured tv shows ``you have been chopped'' · semantic segmentation of tv programs...

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Structured TV Shows — “You Have Been Chopped”

1. IntroductionWith an increasing demand for rich, immersive and inter-active experiences, viewers want to engage with their fa-vorite TV shows to learn and discuss the characters, back-stories and potential arcs in the storyline. Many showshave long story-line arcs spanning multiple episodes andeven seasons leading to complex narratives such as “Gameof Thrones”. On the other end of the spectrum, we havehighly-structured shows which have a tightly scripted for-mat which does not vary much across the episodes andeven multiple seasons — examples include the TV shows“Shark Tank”, “Chopped”, “The Voice”, and “AmericanNinja Warrior” involving entrepreneurs, chefs, singers, andathletes respectively. The structure arises due to the showsusing the same venues, music themes for specific transi-tions and specific words for contextual changes. Chapter-ing is critical to obtain the contextual information.

Most existing literature focus on video segmentation orchaptering using audio-visual analysis (Sargent et al., 2014;Aoki, 2006). They are not robust to generate semanticchapters. Previous work (Yamamoto et al., 2014) presents asemantic segmentation of TV programs based on detectionof corner-subtitles, but the corresponding semantic infor-mation is limited to Japanese TV programs.

To build a context-sensitive video question answering sys-tem, we leverage the structure of the show to extract meta-data by Closed Caption (CC) analysis and visual analy-sis. In particular, we incorporate semantics through do-main knowledge, in the form of transition phrases, in or-der to chapter structured TV shows. The metadata are thenused for indexing and retrieval of entities of interest withrespect to chapters, and the corresponding chopped seg-ments containing the entities are presented to end users. Webuild a video question answering system for the TV show“Chopped”, and the system is illustrated in Figure 1.

2. Semantic Analysis of VideosData Preparation. We processed 60 videos of the“Chopped” consisting of various episodes in 6 seasons- (27, 23, 22, 21, 15, 14). CC segments were extractedfrom the raw videos using the ffmpeg package. Weuse information about the show such as cooking ingredi-ents, episode title, names of the judges, participating con-

testants and the winners of the various rounds extractedfrom Wikipedia. We extract food-related phrases from theCC using Stanford Core NLP (Manning et al., 2014) andWord2Vec (Mikolov et al., 2013).

Chopped segments. We segment the videos by chap-ters with very little domain knowledge but leveraging theshared structured across Chopped episodes. First piece ofdomain knowledge is that each episode consists of threechapters, namely the cooking of appetizers, entrees, anddesserts. Also, we obtain seed examples of what thehost says during a chapter transition; e.g., when host says“Please open your baskets”, the ingredient basket for theupcoming chapter is revealed – this is used as a signal forthe start of the chapter. Similarly, “Chef X, you have beenchopped” is a phrase uttered at the end of a chapter. Onlywith this domain knowledge, we create a small set of reg-ular expressions to catch the CC text signaling a chaptertransition.

For ith episode, we find all captions that match the regularexpression, which produces a set of candidate start and endmarkers, Si and Ei, respectively. While the regular expres-sions are expected to accurately find the start and end mark-ers in most episodes, typos and linguistic variations mightcreate noise (e.g., the second start marker is between thefirst start and end markers). Due to this, getting the actualinterval for each chapter is not always trivial. To select theoptimal chapter intervals, we rely on reference episodes,where we have the expected sequence of start-end markersfor each chapter (i.e., for ‘Chopped’, this means that wehave three start and three end markers, for appetizer, entreeand dessert chapters). We find the earliest start time and lat-est end time for each chapter, across all episodes, denotedby (Sapp,Eapp); (Sentr,Eentr); (Sdsrt,Edsrt). Given thesereference intervals, as well candidate start-end markers ofthe ith episode, we compute the optimal start time of eachchapter by finding the earliest start marker that falls into thereference interval for that chapter. Similarly, we computethe optimal end time by finding the latest such end marker:

Siapp = min{s|s ∈ Si ∧ s ∈ (Sapp, Eapp)}

Eiapp = max{e|e ∈ Ei ∧ e ∈ (Sapp, Eapp)}

OCR on Selected Scenes. Since we know where thecontents of the basket are revealed and the ingredi-ents are not included in the closed captions, we rec-

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Structured Shows — “You Have Been Chopped”

ognize the characters on screen by an optical character

Figure 1. Our system screenshot for input query “bok choy” and retrieved video compilation.

recognition (OCR) algo-rithm (Smith, 2007), namedTesseract OCR, on that mo-ment. Specifically, we runthe OCR only on possibletext region candidates forcomputational efficiency. Theregion candidates are found bycombining the morphologicalfiltering and the binarizationof the images with the Otsuthreshold (Otsu, 1979). TheTesseract OCR fine-tunes theword regions in the candidateregions by line finding, baselinefitting and proportional wordfinding. Then, it encodes eachcharacter into a segment featureof polygonal approximationand efficiently classifies it byquantization of the featurevector and map it into a hashcode. Once we have the OCRoutputs for multiple frames ofthe food basket ingredients, we clean up and find the foodphrases by (a) removing strings that appear in fewer thanten frames, and (b) keeping only phrases that are in thevocabulary of Google word embeddings.

Indexing Food Mentions. Noun phrases that appear inclosed captions across all episodes are found using Stan-ford CoreNLP. We create an inverted index, so that givena noun phrase, we can access all time intervals of eachepisode where it was mentioned. We also create a filterspecifically for Chopped, where we want to distinguishfood phrases from other noun phrases. In order to auto-mate this, we take advantage of pre-trained word embed-dings (Mikolov et al., 2013). The vector of a noun phraseis the average of the vectors of each word (ignoring stopwords). We then compare this to the vector of “food”, andkeep phrases with a cosine similarity beyond a threshold(e.g. 0.20 in this case.)

Search Interface. As a use case of the metadata that weextracted from Chopped, we built an application wherethe user can query an ingredient or dish, from which wegenerate a custom “Chopped” compilation video. Our ap-proach uses the inverted index of food phrases to retrieveall time intervals in which the query was mentioned, acrossall episodes. Since each time interval is very short (fewseconds on average), we try to merge nearby mentions asmuch as possible, so that the compiled video is not choppedinto too many pieces. For this, for each episode, we sort allretrieved intervals, and merge consecutive ones if the dif-

ference is less than 5 seconds — we keep doing this until

there are no intervals closer than 5 seconds apart. In or-der to add more context, we then expand each interval by3 seconds to the left and right. While this heuristic worksfine in many cases, selecting the optimal expansion pointis a non-trivial problem. We can use the closed captionsto find sentence boundaries, as well as visual features todetect shot boundaries.

3. Conclusions and Future WorkDomain knowledge in the form of a small set of key phrasescan provide semantic content about a TV series, and whenused in conjunction with a data-driven approach to alignthe structure across episodes, this can help with chapter-ing each episode. This was demonstrated successfully forthe TV show “Chopped”, where we built a query engineto search for food phrases across episodes. We would liketo generalize the approach to other structured shows suchas “Shark Tank”, “American Ninja Warrior” among oth-ers. Also, using video information such as shot-boundaryand audio information such as speaker diarization can helpwith finer segmentation of relevant scenes (Knyazeva et al.,2015).

ReferencesAoki, Hisashi. High-speed topic organizer of tv shows us-

ing video dialog detection. Systems and Computers inJapan, 37(6), 2006.

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Structured Shows — “You Have Been Chopped”

Knyazeva, Elena, Wisniewski, Guillaume, Bredin, Herve,and Yvon, Francois. Structured prediction for speakeridentification in tv series. In Interspeech 2015, 16th An-nual Conference of the International Speech Communi-cation Association, Dresden, Germany, September 2015.

Manning, Christopher D., Surdeanu, Mihai, Bauer, John,Finkel, Jenny, Bethard, Steven J., and McClosky,David. The Stanford CoreNLP natural languageprocessing toolkit. In Association for Computa-tional Linguistics (ACL) System Demonstrations, pp.55–60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010.

Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean,Jeffrey. Efficient estimation of word representationsin vector space. CoRR, abs/1301.3781, 2013. URLhttp://arxiv.org/abs/1301.3781.

Otsu, N. A threshold selection method from gray-level his-tograms. IEEE Trans. Sys., Man., Cyber., 1979.

Sargent, Gabriel, Hanna, Pierre, and Nicolas, Henri. Seg-mentation of music video streams in music piecesthrough audio-visual analysis. In 2014 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pp. 724 – 728, 2014.

Smith, R. An Overview of the Tesseract OCR Engine. InProc. Ninth Int. Conference on Document Analysis andRecognition (ICDAR), pp. 629–633, 2007.

Yamamoto, Koji, Takayama, Shunsuke, and Aoki, Hisashi.Semantic segmentation of tv programs using corner-subtitles. In 2009 IEEE 13th International Symposiumon Consumer Electronics, pp. 205 – 208, 2014.

http://www.aclweb.org/anthology/P/P14/P14-5010

http://www.aclweb.org/anthology/P/P14/P14-5010

http://arxiv.org/abs/1301.3781

structured tv shows ``you have been chopped'' · semantic segmentation of tv programs...

Documents