instructional videos for unsupervised harvesting and...

Instructional Videos for Unsupervised Harvesting andLearning of Action Examples

Shoou-I YuCarnegie Mellon University

5000 Forbes AvenuePittsburgh, PA, 15213, USA

[email protected]

Lu JiangCarnegie Mellon University


[email protected]

Alexander HauptmannCarnegie Mellon University


[email protected]

ABSTRACTOnline instructional videos have become a popular way forpeople to learn new skills encompassing art, cooking andsports. As watching instructional videos is a natural wayfor humans to learn, analogously, machines can also gainknowledge from these videos. We propose to utilize thelarge amount of instructional videos available online to har-vest examples of various actions in an unsupervised fashion.The key observation is that in instructional videos, the in-structor’s action is highly correlated with the instructor’snarration. By leveraging this correlation, we can exploitthe timing of action corresponding terms in the speech tran-script to temporally localize actions in the video and har-vest action examples. The proposed method is scalable asit requires no human intervention. Experiments show thatthe examples harvested are of reasonably good quality, andaction detectors trained on data collected by our unsuper-vised method yields comparable performance with detectorstrained with manually collected data on the TRECVID Mul-timedia Event Detection task.

Categories and Subject DescriptorsH.3.1 [Information Storage and Retrieval]: ContentAnalysis and Indexing; H.4 [Information Systems Ap-plications]: Miscellaneous

General TermsAlgorithms, Experimentation, Performance

KeywordsSemantic Action Detection; Unsupervised Data Collection;Multimedia Event Detection

1. INTRODUCTIONOnline user-generated instructional videos, also known as

How-To videos, have become a rich resource for people to

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 3–7, 2014, Orlando, Florida, USA.Copyright 2014 ACM 978-1-4503-3063-3/14/11 ...$15.00.http://dx.doi.org/10.1145/2647868.2654997.

Figure 1: Positive “drill hole” examples harvestedfrom instructional videos in an unsupervised fash-ion. Blue words correspond to action signifiers, andthe red words correspond to verb-noun pairs foundby a dependency parser. Drilling holes in diversecontexts such as in walls and on ice are collected.

learn everyday skills encompassing art, cooking and sports.Instructional videos teach viewers how to perform a specifictask through clear explanations and demonstrations by theinstructor. In the recent substantial growth on media such asYouTube, a considerable number of user-generated instruc-tional videos covering a wide variety of tasks are availableonline for humans to view and learn.

Since instructional videos are designed to facilitate learn-ing for humans, analogously, machines should also be ableto learn from this rich resource. In this paper, we focus onexploiting instructional videos to harvest examples of differ-ent actions in an unsupervised fashion. Semantic action andconcept detectors trained on video data play an importantrole in various video content analysis tasks such as multime-dia event detection [15, 10, 7, 17, 3, 11]. However, training asemantic detector requires a pool of positive examples, whichis tedious to collect manually. Manual labeling of video datais more challenging than image data as the former requiresan extra step in temporally localizing the action. For ex-ample, the collection of the TRECVID Semantic Indexingdata set [12], which is a very large annotated video dataset, required many person years of effort from more than 20companies and universities. The 346 manually annotatedconcepts in this data set includes 35 action related conceptswith a total 17,000 positive examples. The positive exam-

ples of other existing video data sets such as UCF50 [14] andHMDB51 [8] were also collected manually.

In light of the aforementioned problems, we propose toharvest examples of various actions in an unsupervised fash-ion by utilizing the large amount of instructional videos on-line. We exploit the visual-speech correlation in instruc-tional videos for unsupervised action example collection. Toteach effectively, instructors in the instructional video willnot only visually demonstrate what they are doing at eachstep, but also verbally explain the details of each step nearlysimultaneously. For example, the instructor will verbally say“I am going to take my drill and drill a hole” while drillinga hole into some wood as shown in Figure 1. This seman-tic audio-visual correlation residing in instructional videosis the key to our unsupervised action example collectionmethod. When the instructor verbally say phrases such as“going to” or “let us”, which signify that some action will oc-cur immediately, we identify the action that will be visiblein the video using the speech transcript and Natural Lan-guage Processing techniques. The action examples localizedin this unsupervised fashion can be used to train semanticaction detectors. In sum, our contributions are as follows.

1. We propose to utilize the ever-increasing instructionalvideos available on the Internet to harvest positive ex-amples for video action detectors in an unsupervisedfashion. As the examples are obtained without anyhuman intervention, the proposed method is scalable.

2. Experiments show that action examples harvested withour method are of reasonable quality, and detectorstrained with the harvested data performs comparablywith detectors trained on manually collected data inthe TRECVID Multimedia Event Detection task [12].

2. RELATED WORKThe challenge of efficient collection of image or video anno-

tations has been tackled in different ways, but most existingmethods still require a human in the loop. The annotationsof ImageNet [5] and TRECVID Semantic Indexing [12] wereall collected manually. ImageNet utilized the Amazon Me-chanical Turk (AMT) platform to acquire annotations. Theannotation of Semantic Indexing data required many per-son years of effort from more than 20 participants includingcompanies and universities around the world. This collabo-rative annotation process was tedious and time-consuming.To make the annotation task less tedious and more fun,“Gamification” [16, 6] transformed the annotation task toa game so that as humans play the game, more and moreannotations are collected. Using this method, the labelsare obtained as a by-product of a competitive game, butdesigning an interesting game which will attract many play-ers is very challenging. A method to collect image trainingdata without a human in the loop is to utilize search engineoutput [5]. However, existing work mainly focused on stillimages, but our method focuses on temporal localization ofactions in videos, which is more challenging.

3. UNSUPERVISED DATA COLLECTIONWe observe temporal correlation between speech and ac-

tions in instructional videos and utilize the correlation tocollect examples for actions in an unsupervised fashion. Ourpipeline is shown in Figure 2. Given an instructional video,the speech transcript is obtained from either closed-captionsor automatic speech recognition output. The next step is to

Figure 2: Pipeline for unsupervised harvesting ofaction examples from instructional videos.

scan the transcript and look for sentences containing actionsignifiers such as “going to” and “let us”, which signify thatsome action will occur in the near future. Having foundthe action signifier in a sentence, a dependency parser [4]parses the sentence to determine which verb and noun weredescribed by the action signifier. Specifically, we look fordirect object (dobj ) relations [4]. The dobj relation showswhich noun is the accusative object of a verb. For example,in the sentence “You are going to bend your left leg”, theverb “bend” and the noun “leg” form a dobj relation. Weassume that as the instructor verbally say the sentence withthe action signifier, the instructor is also performing the verbin the dobj relation to the noun in the dobj relation. There-fore, under this assumption, the speech timing informationof the dobj relation is utilized to localize the action. Thisassumption is substantiated through manual verification inSection 4.1. This process is performed over the whole dataset to get a diverse set of dobj relations each with manypositive examples. Non-visual dobj relations such as “askme” and “take care” can be filtered out with a predefinedontology of visible words such as ImageNet’s ontology [5,19]. The final output is a list of dobj relations, each withmany positive video examples extracted in an unsupervisedfashion from instructional videos.

There are three main advantages to our method. First,collection of action examples is completely unsupervised.Therefore, even though the action signifiers may not locateall instances of a given action in the video, we still can col-lect large amounts of positive examples with no manual la-bor cost as more instructional videos are uploaded. Thisis barely possible for data sets collected with manual labor[14, 8, 12]. Second, action examples are associated withcontext. For example, instead of just learning the action“cut”, our method automatically learns objects associatedwith this action, such as “cut onion”, “cut paper” and “cuthair” as shown in Figure 3. Finally, as more data is collectedfrom instructional videos on various tasks, more diverse ac-tions tend to be collected. For example, in Figure 1, wecollected diverse positive examples for “drill hole”. Holescould either be drilled in wood, in a wall or even in ice.

4. EXPERIMENTSTo evaluate the performance of our unsupervised action

example collection, we downloaded around 73,000 videos

CategoryVerb and NounBoth Correct

VerbCorrect

NounCorrect

Ball Sports 0.256 0.280 0.727Cooking 0.623 0.692 0.742

Exercise / Yoga 0.866 0.889 0.948Sewing / Tools 0.727 0.735 0.860

Average 0.618 0.649 0.819

Table 1: Average accuracy of collected action exam-ples for 38 classes from different categories.

from online archives such as Creative Commons, Youku, Tu-dou and YouTube. For each video, the raw closed-captionsare available. We will discuss how to utilize videos with-out closed-captions in Section 5. The closed-captions wereprocessed using the method detailed in Section 3. The ac-tion signifiers used were “going to”, “gonna”, “let us”, “let’s”“we’ll”, and “we will”. The Stanford Dependency Parser [4]was used for parsing. The verbs and nouns were stemmedwith the Porter Stemmer [13] for all dobj relations found toremove morphological differences. All dobj relations whichoccur less than 20 times were removed. As the coverage ofthe visible verb and noun ontology from [19] was not per-fect, we manually filtered out non-visual relations. At theend, 25,835 positive examples were collected for 541 dobj re-lations. To extract video segments for a given dobj relation,we segment out a 10 second video which starts 2.5 secondsbefore the dobj relation was spoken and ends 7.5 secondsafter. This is based on the observation that the action usu-ally occurs slightly after the instructor mentions the dobjrelation. The quality of the collected labels is evaluated intwo aspects. First, the accuracy of the collected labels wasmanually assessed. Second, we compared the effectivenessof the action detectors trained from our collected examplesto detectors trained on manually annotated data when usedas an intermediate representation for another separate taskcalled Multimedia Event Detection (MED).

4.1 Accuracy of Collected LabelsAs our unsupervised method is based on the assumption

that the actions and the speech in instructional videos arehighly correlated, we verified this assumption by manuallyinspecting 38 representative classes and computing the ac-curacy of the collected label for each class. Given a collectedpositive example and its corresponding dobj relation, thereare different levels of correctness; either both the verb andthe noun is correct, or only the verb or noun is correct. Averb or noun is correct if the action or object can be seenin the video, respectively. The results are shown in Table 1.Due to space constraints, the 38 classes were grouped into 4rough categories. The “Ball Sports” category contains rela-tions such as “catch ball” and “hit ball”. The “Cooking” cat-egory contains relations such as “cut tomatoes”, and “pourmilk”. The “Exercise / Yoga” category contains relationssuch as “bend elbows” and “lift heel”. The “Sewing / Tools”category contains relations such as “cut paper”, “use screw-driver” and “wrap yarn”. In general, for categories otherthan ball sports, the dobj relation is completely correct 70%of the time. For “Exercise / Yoga”, the performance goesup to 86.6% because the speech and this type of actionsare highly correlated. For “Ball Sports”, the speech and ac-tion are lowly correlated as most instructors will do all theexplaining first before performing the action, thus makingthe accuracy low. However, for the accuracy of the nouns,

Figure 3: Positive examples for verb “cut” collectedfrom instructional videos. Our method collected theverb “cut” in many different contexts; from cuttingwith scissors, knives to paper cutters.

our performance is still 72.7% on “Ball Sports”. The resultssubstantiate that our assumption on audio-visual correlationholds for most actions.

4.2 Intermediate Representation for MEDWe compared the effectiveness of detectors trained on our

harvested examples to detectors trained with manually an-notated data from UCF50 [14] and HMDB51 [8]. An idealmethod for evaluating action detectors from different datasets is to compute cross data set performance for actions thedata sets have in common. However, as there are very fewshared actions across data sets, we could not perform thisexperiment. Therefore, the effectiveness and generalizabilityof the detectors trained from each data set were evaluatedby utilizing the output of the detectors as an intermedi-ate representation for MED [9]. To create an intermediaterepresentation for each MED video, each detector was pre-dicted on the feature vector corresponding to the video, thusgetting a n dimensional intermediate representation, wheren is the number of action detectors. This representationwas used to perform event detection with a χ2 SVM [1].The TRECVID MED 2011 development set [12] consistingof 9746 videos was used. For the evaluation metric, MeanAverage Precision (MAP) was computed on 18 events.

Action detectors were trained using state-of-the-art Im-proved Trajectories [18] with a Fisher Vector [2] represen-tation. One-against-all linear SVMs were used. Detectorswere trained using all the data from HMDB51 and UCF50to obtain 51 and 50 detectors respectively. From our col-lected data, 541 action detectors were trained. As a base-line which is labeled as “Clustering” in Table 2 and Figure 4,we ignored the labels of the 25,835 harvested examples andclustered the data into 541 clusters. The cluster assign-ments of each data point were treated as a pseudo-label andused to train 541 non-semantic detectors. This compareshow good our collected labels are compared to pseudo-labelsfound by clustering. For a fair comparison between data setswith a different number of detectors, we randomly selectedd ∈ {10, 25, 50, 100, 250, 541} detectors from each data set.The prediction score of the d selected detectors on a video

Data SetNumber of Intermediate Action Detectors

10 25 50 100 250 541HMDB51 (Manual) 7.9 ± 0.4 16.3 ± 0.4 23.6 ± 0.1 Not applicable. Only has 51 classesUCF50 (Manual) 6.8 ± 0.3 12.9 ± 0.3 18.7 ± 0.0 Not applicable. Only has 50 classes

Clustering (Unsupervised) 5.5 ± 0.3 10.0 ± 0.3 13.6 ± 0.4 17.7 ± 0.4 22.6 ± 0.4 26.7 ± 0.0Ours (Unsupervised) 5.7 ± 0.3 11.4 ± 0.4 19.2 ± 0.4 25.8 ± 0.4 33.9 ± 0.3 38.3 ± 0.0

Table 2: MAP (x100) on MED w.r.t. the number of intermediate action detectors.

0 50 100 150 200 2500.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of Intermediate Action Detectors

Me

an

Ave

rag

e P

recis

ion

HMDB51 (Manual)

UCF50 (Manual)

Clustering (Unsupervised)

Ours (Unsupervised)

Figure 4: MAP on MED w.r.t. number of interme-diate action detectors. More details are in Table 2.

were used as an intermediate representation and evaluatedon MED. For each d, random sampling was repeated 30times for each data set.

The performance is shown in Table 2 and Figure 4. The95% confidence interval is shown in Table 2. From Fig-ure 4, we see that the detectors trained on labels collected inan unsupervised fashion performs as well as or even betterthan the detectors trained on manually labeled data. Forexample, at 50 dimensions, our performance outperformsUCF50 with a statistical significant difference at p-value of0.0058, according to the paired t-test. Also, by comparingour result with the “Clustering” result, we see that detec-tors trained with labels collected by utilizing the correlationbetween speech and actions significantly outperforms detec-tors trained with pseudo-labels found by clustering. Theseresults suggest that the action examples collected by ourproposed method contain more semantic information thanthe detectors trained simply from clustering. In sum, wehypothesize that the collected labels capture semantic infor-mation, thus enabling our detectors to perform as good asthe detectors trained from manually collected data.

5. CONCLUSIONS AND FUTURE WORKGiven the positive results of our proposed unsupervised

action example harvesting method, our method may repre-sent a new direction to collecting large amounts of exampleson a wide variety of actions for semantic action detectionin videos. In this paper, as a pilot study, only videos withclosed-captions were used. To further improve the scalabil-ity by utilizing instructional videos without closed-captions,the transcript can be obtained by the state-of-the-art Au-tomatic Speech Recognition methods. The recognized tran-scripts are expected to have reasonable accuracy because ininstructional videos, the speaker is often close to camera andspeaks very clearly. We plan to study this problem in future.

6. ACKNOWLEDGMENTSThis paper was partially supported by the US Department

of Defense the U. S. Army Research Office (W911NF-13-1-0277) and partially supported by the National Science Foun-dation under Grant Number IIS-12511827. Any opinions,findings, and conclusions or recommendations expressed inthis material are those of the authors and do not necessarilyreflect the views of ARO and NSF.

7. REFERENCES[1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support

vector machines. In ACM Transactions on Intelligent Systemsand Technology, 2011.

[2] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. Thedevil is in the details: an evaluation of recent feature encodingmethods. In BMVC, 2011.

[3] J. Chen, Y. Cui, G. Ye, D. Liu, and S.-F. Chang. Event-drivensemantic concept discovery by exploiting weakly taggedinternet images. In ICMR, 2014.

[4] M.-C. De Marneffe, B. MacCartney, C. D. Manning, et al.Generating typed dependency parses from phrase structureparses. In LREC, 2006.

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. In CVPR,2009.

[6] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcingfor fine-grained recognition. In CVPR, 2013.

[7] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann.Zero-example event search using multimodal pseudo relevancefeedback. In ICMR, 2014.

[8] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: a large video database for human motion recognition.In ICCV, 2011.

[9] Z.-Z. Lan, L. Jiang, S.-I. Yu, S. Rawat, Y. Cai, C. Gao, S. Xu,H. Shen, X. Li, Y. Wang, et al. CMU-Informedia @ TRECVID2013 multimedia event detection. In TREC Video RetrievalEvaluation, 2013.

[10] M. Mazloom, A. Habibian, and C. G. Snoek. Querying forvideo events by semantic signatures from few examples. InACM MM, 2013.

[11] P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis,U. Park, and R. Prasad. Multimodal feature fusion for robustevent detection in web videos. In CVPR, 2012.

[12] P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel, A. F.Smeaton, W. Kraaij, G. Quenot, et al. Trecvid 2011-anoverview of the goals, tasks, data, evaluation mechanisms andmetrics. In TREC Video Retrieval Evaluation, 2011.

[13] M. F. Porter. An algorithm for suffix stripping. In Electroniclibrary and information systems, 1980.

[14] K. K. Reddy and M. Shah. Recognizing 50 human actioncategories of web videos. In Machine Vision and Applications,2013.

[15] W. Tong, Y. Yang, L. Jiang, S.-I. Yu, Z. Lan, Z. Ma, W. Sze,E. Younessian, and A. Hauptmann. E-lamp: integration ofinnovative ideas for multimedia event detection. In MachineVision and Applications, 2014.

[16] L. Von Ahn and L. Dabbish. Labeling images with a computergame. In CHI, 2004.

[17] E. F. Can and R. Manmatha. Modeling concept dependenciesfor event detection. In ICMR, 2014.

[18] H. Wang, C. Schmid, et al. Action recognition with improvedtrajectories. In ICCV, 2013.

[19] S. Zinger, C. Millet, B. Mathieu, G. Grefenstette, P. Hede, andP.-A. Moellic. Extracting an ontology of portrayable objectsfrom wordnet. In MUSCLE / ImageCLEF workshop on Imageand Video retrieval evaluation, 2005.

instructional videos for unsupervised harvesting and...

Documents