a real-world web cross-media dataset containing images ...cross-media, real-world, dataset, texts,...

4
A Real-World Web Cross-Media Dataset Containing Images, Texts and Videos Yanbin Liu School of Computer Science and Technology Tianjin University, China [email protected] Yahong Han School of Computer Science and Technology Tianjin University, China Tianjin Key Laboratory of Cognitive Computing and Application, China [email protected] ABSTRACT During recent years, the amount of multimedia data on so- cial websites is growing exponentially. It is observed that multimedia data corresponding to the same semantic con- cept usually appears in different media types and from het- erogeneous data sources. In order to synchronize and lever- age these diverse forms of media data for multimedia ap- plications, we present a real-world web dataset collected from Google, Flickr and YouTube for cross-media research. The dataset includes 41,387 text files, 65,371 images and 30,818 videos (about 1091 hours) which are correlated se- mantically with each other by 335 representative visual con- cepts. Widely-used features are extracted for each media type and all of them are publicly available. To evaluate the performance of our dataset, experiments on baseline recog- nition, feature evaluation and domain adaptation are per- formed. The experimental results indicate that it is possible to perform multiple cross-media tasks based on our proposed dataset. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models; H.5.1 [Multimedia Information Systems]: E- valuation/methodology General Terms Measurement, Standardization Keywords Cross-Media, Real-World, Dataset, Texts, Images, Videos 1. INTRODUCTION Nowadays, there are large amount of heterogeneous and homogeneous media data from multiple sources, such as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICIMCS’14, July 10–12, 2014, Xiamen, Fujian, China. Copyright 2014 ACM 978-1-4503-2810-4/14/07 ...$15.00. news media websites, microblog, mobile phone, social net- working websites, and photo/video sharing websites. Thus, these multimedia data related to the same topic are usual- ly from different data types and sources. Cross-media is a research area in the general multimedia field which focus- es on the utilization of data with different modalities from multiple sources to understand the media data. Comparing to the general multimedia analysis, cross-media emphasizes more on the correlations among the media data represent- ed by heterogeneous features and obtained from different sources. A well-devised cross-media dataset is important for estimating the stability of systems and checking the validity of the novel algorithms. In this paper, we present a novel dataset for the evalu- ation of cross-media analysis. The dataset includes three media types (texts, images, videos) collected from Google, Flickr and YouTube respectively. The three media types are bridged semantically with each other by 335 visual concept- s, which include a wide range of various categories (objects, actions, scenes, humans, animals, etc.). Various features are extracted for each media type (This will be described in detail in Section 2). Comparing to existing multimedia and cross-media dataset- s, our proposed dataset has the following contributions: Variety. The dataset contains three media types while most existing datasets only have one or at most two media types just like NUS-WIDE [1], CC WEB VIDEO [10], MFlickr25000 [4] and UQ IMH [9]. Our dataset is built from multiple sources (Google, Flickr and Y- ouTube). The semantic concepts are from different sources (Columbia374, WordNet and MediaMill) and contain various categories like objects, actions, scenes, humans and so on. To facilitate experiment, many kinds of features are extracted for each media type. Volume. Our dataset is the largest cross-media dataset comprising 41,387 text files, 65,371 images and 30,818 videos with associated 335 concepts. This dataset is much larger than the existing cross-media dataset. As three different types of media data are semantically inter-correlated by 335 concepts, it is a good test bed for the evaluation of cross-media analysis. Real-World Data. All the data are collected by sub- mitting each of the 335 concepts as a query to Google, Flickr and YouTube, respectively. These data are cre- ated and shared on the web by common users. Thus, the presented dataset contains real-world cross-media

Upload: others

Post on 07-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Real-World Web Cross-Media Dataset Containing Images ...Cross-Media, Real-World, Dataset, Texts, Images, Videos 1. INTRODUCTION Nowadays, there are large amount of heterogeneous

A Real-World Web Cross-Media Dataset ContainingImages, Texts and Videos

Yanbin LiuSchool of Computer Science and Technology

Tianjin University, [email protected]

Yahong HanSchool of Computer Science and Technology

Tianjin University, ChinaTianjin Key Laboratory of Cognitive Computing

and Application, [email protected]

ABSTRACTDuring recent years, the amount of multimedia data on so-cial websites is growing exponentially. It is observed thatmultimedia data corresponding to the same semantic con-cept usually appears in different media types and from het-erogeneous data sources. In order to synchronize and lever-age these diverse forms of media data for multimedia ap-plications, we present a real-world web dataset collectedfrom Google, Flickr and YouTube for cross-media research.The dataset includes 41,387 text files, 65,371 images and30,818 videos (about 1091 hours) which are correlated se-mantically with each other by 335 representative visual con-cepts. Widely-used features are extracted for each mediatype and all of them are publicly available. To evaluate theperformance of our dataset, experiments on baseline recog-nition, feature evaluation and domain adaptation are per-formed. The experimental results indicate that it is possibleto perform multiple cross-media tasks based on our proposeddataset.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrievalmodels; H.5.1 [Multimedia Information Systems]: E-valuation/methodology

General TermsMeasurement, Standardization

KeywordsCross-Media, Real-World, Dataset, Texts, Images, Videos

1. INTRODUCTIONNowadays, there are large amount of heterogeneous and

homogeneous media data from multiple sources, such as

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICIMCS’14, July 10–12, 2014, Xiamen, Fujian, China.Copyright 2014 ACM 978-1-4503-2810-4/14/07 ...$15.00.

news media websites, microblog, mobile phone, social net-working websites, and photo/video sharing websites. Thus,these multimedia data related to the same topic are usual-ly from different data types and sources. Cross-media is aresearch area in the general multimedia field which focus-es on the utilization of data with different modalities frommultiple sources to understand the media data. Comparingto the general multimedia analysis, cross-media emphasizesmore on the correlations among the media data represent-ed by heterogeneous features and obtained from differentsources. A well-devised cross-media dataset is important forestimating the stability of systems and checking the validityof the novel algorithms.

In this paper, we present a novel dataset for the evalu-ation of cross-media analysis. The dataset includes threemedia types (texts, images, videos) collected from Google,Flickr and YouTube respectively. The three media types arebridged semantically with each other by 335 visual concept-s, which include a wide range of various categories (objects,actions, scenes, humans, animals, etc.). Various featuresare extracted for each media type (This will be described indetail in Section 2).

Comparing to existing multimedia and cross-media dataset-s, our proposed dataset has the following contributions:

• Variety. The dataset contains three media types whilemost existing datasets only have one or at most twomedia types just like NUS-WIDE [1], CC WEB VIDEO[10], MFlickr25000 [4] and UQ IMH [9]. Our datasetis built from multiple sources (Google, Flickr and Y-ouTube). The semantic concepts are from differentsources (Columbia374, WordNet and MediaMill) andcontain various categories like objects, actions, scenes,humans and so on. To facilitate experiment, manykinds of features are extracted for each media type.

• Volume. Our dataset is the largest cross-media datasetcomprising 41,387 text files, 65,371 images and 30,818videos with associated 335 concepts. This dataset ismuch larger than the existing cross-media dataset. Asthree different types of media data are semanticallyinter-correlated by 335 concepts, it is a good test bedfor the evaluation of cross-media analysis.

• Real-World Data. All the data are collected by sub-mitting each of the 335 concepts as a query to Google,Flickr and YouTube, respectively. These data are cre-ated and shared on the web by common users. Thus,the presented dataset contains real-world cross-media

Page 2: A Real-World Web Cross-Media Dataset Containing Images ...Cross-Media, Real-World, Dataset, Texts, Images, Videos 1. INTRODUCTION Nowadays, there are large amount of heterogeneous

Figure 1: Example concepts in our dataset. Different colors of rectangles and words indicate multiple sources of concepts. The concepts are inalphabetical order.

data on the web, which makes it suitable for the eval-uation of real-world cross-media applications.

The rest of the paper is organized as follows. Section2 introduces the visual concepts, three media types and theextracted features. In Section 3, we describe the experimentsand results on baseline classification, feature evaluation anddomain adaptation to show the performance of our dataset.Finally, Section 4 contains the conclusion and discussion forfuture work.

2. CONCEPTS AND DATASET

2.1 Visual ConceptsIn this dataset, 335 visual concepts are obtained from

Columbia374 [11], WordNet and MediaMill Concept Vocab-ulary [3]. Columbia374 is a subset of LSCOM ontology [7]which is jointly defined by researchers, information analysts,and ontology specialists according to usefulness, feasibility,and observability. These concepts are related to events, ob-jects, locations, and peoples. And we downloaded all thesynonyms of Columbia374 from WordNet to extend its size.The MediaMill Concept Vocabulary is a subset of the HAV-IC corpus which covers a wider range of concepts such asobjects, actions, scenes, humans and animals. Finally, wechoose 335 representative and meaningful concepts of all inour dataset. Concept category and some example conceptsof our dataset are shown in Figure 1. From the sources andcategories of concepts in our dataset, we can see that thisdataset contains a wide range of visual concepts that can bebeneficial for potential cross-media researches.

2.2 The Dataset

2.2.1 TextsGoogle is now the most influential search engine in the

world. We submit each of the 335 concepts to Goolge search

How to Store Milk - Milk | Dairy Goodness - Nourish

your day

It's easy to achieve the best conditions to keep milk fresh

for as long as possible simply by applying some basic

principles. Here are some handy tips and tricks to get the

most from your milk:At the supermarket:

Transport Department - Parking MetersAbout 18,000

on-street parking spaces are provided where they do not

obstruct traffic. Where there are limited spaces and a

high demand, they are metered to deter prolonged

parking. Most of the metered parking spaces are

available from 8 am to midnight on weekdays, and from

10 am to 10 pm on Sundays and public holidays.

Komodo Dragon Fact Sheet - National ZooTaxonomy

Order: SquamataFamily: VaranidaeGenus/species:

Varanus komodoensis

In 1992, Komodo dragons hatched for the first time

outside of Indonesia at the National Zoo.

Colin Powell Biography - Facts, Birthday, Life Story -

Biography.comSynopsis

Colin Luther Powell is a United States statesman and a

retired four-star general in the United States Army. He

was the 65th United States Secretary of State (2001-

2005), serving under President George W. Bush. He

Monitoring Election Campaign Finance: A Handbook

for NGOs | Open Society Foundations (OSF)The Open

Society Justice Initiative has published a handbook to

aid civil society groups in tracking election campaign

finances and exposing corruption. Monitoring Election

Campaign Finance: Handbook for NGOs is the most

Black Bear | Basic Facts About Black Bears | Defenders

of Wildlife The American black bear is the smallest of

the three bears species found in North America, and are

found only in North America. Black bears have short,

non-retractable claws that give them an excellent tree-

climbing ability.

milk can parking meter

Komodo dragon Colin Powell

Election Campaign American black bear

Figure 2: Sample texts in our dataset.

engine and gather the returned webpages. Here the top 200results’ URLs are recorded. With these URLs, we downloadthe corresponding web pages and extract their text contents.In this process, pages with messy code and having little use-ful messages are filtered. In total, 41,387 text files are ob-tained, in which there are about 100 to 160 text files foreach concept. The text contents of some sample conceptsare shown in Figure 2.

The original texts are post-processed and features are ex-tracted in order to facility experiments on this dataset. Wefirst perform the Porter Stemming algorithm [6] to reduceinflected (or sometimes derived) words to their stem form.Then a dictionary with 47,272 distinct words are construct-ed. TF-IDF1 values are first computed on each single wordof text files. BOW model is applied to quantize each textfile into 47,272 dimension. To reduce the high dimensional-ity, LSI (Latent Semantic Indexing) method is adopted tomap the BOW features to topic space. Finally, each text isrepresented by 700 dimension feature vector.

1Term Frequency & Inverse Document Frequency

Page 3: A Real-World Web Cross-Media Dataset Containing Images ...Cross-Media, Real-World, Dataset, Texts, Images, Videos 1. INTRODUCTION Nowadays, there are large amount of heterogeneous

Figure 3: Sample images in our dataset.

Figure 4: Sample video frames in our dataset.

2.2.2 ImagesAs reported by Verge2 in March 2013, Flickr had a total

of 87 million registered members and more than 3.5 mil-lion new images uploaded daily. In this dataset, the publicAPI provided by Flickr is adopted to download images, tak-ing each of the 335 visual concepts as a query. About 200images are downloaded for each query. The images whosesizes are too small or with inappropriate length-width ratiosare removed. Each image is ensured at least 500 pixels atthe height or width direction. In total, 65,371 images aredownloaded. Figure 3 shows some examples of images inour dataset.Four types of features are extracted for the images:(1)

Color Features, 256-D color histogram.(2) Texture Features,including 120-D gabor wavelets (5 levels, 8 orientations with3 moments) and 59-D lbp.(3) GIST features, we use theimplementation by [5] to get 512-D GIST. (4) SIFT Bow,500-D SIFT description based bag of words.

2.2.3 VideosReal-world video websites such as YouTube, Google Video

and Yahoo! Video are respectively collected as datasets, e.g.,CC WEB VIDEO [10]. In this dataset, videos are collectedfrom YouTube using each of the 335 visual concepts as aquery. A software named youtube-dl form GitHub3 is usedto download videos. To balance the information richness andvideo size, we only selected videos between 3MB and 15MBto download. And because the lack of videos and hardnessof getting videos, only about 100 videos for each concept areobtained. Totally, there are 30,818 videos of 1,091 hours oc-cupying disk size of 278GB. Sample video frames are shownin Figure 4.

2http://www.theverge.com/3https://github.com/

Table 1: Baseline recognition result

Media Type AccuracyTexts 0.5519

Imageshist kernel χ2 kernel0.1813 0.1917

Videoshist kernel χ2 kernel0.0907 0.0947

Table 2: Recognition results of different features

Media Type Accuracy

ImagesGlobal Local All0.1655 0.1301 0.1917

VideosStip Mosift All

0.0801 0.0767 0.0947

Stip and Mosift are widely-used spatial-temporal featuresfor video recognition. We first use the original implementa-tion to extract Stip and Mosift feature points for each videoclip. Then we cluster a randomly selected set of 1,000,000feature points using k -means to build the codebook for Stipand Mosift respectively. The size of our codebook is 1024which has shown empirically to give good results. At last,each video clip is quantized by the codebook to form a 1024dimension Stip and 1024 dimension Mosift.

3. EXPERIMENTS AND RESULTS

3.1 Baseline RecognitionTo evaluate the overall recognition performance of our

dataset, we designed the baseline recognition experimentsfor each media type. For images and texts, we randomlyselected 30% samples for training, the rest for testing. Forvideos, because there are fewer samples and video recogni-tion is a harder task, the training sample ratio is set to 50%.The baseline text classifier is a linear multiclass SVM. As toimages and videos, we employ a nonlinear multiclass SVMwith histogram intersection kernel and χ2 kernel. χ2 kerneland histogram intersection kernel are widely used in visionfield and are proved to have better performance. The de-tail results are shown in Table 1. From Table 1, we can seethat the accuracy of texts and images is higher than videos.This is consistent with our common knowledge that videosare more complicated and harder to recognize. In addition,χ2 kernel shows better performance compared to histogramintersection kernel in our dataset.

3.2 Feature EvaluationAs multiple features are extracted for each media type,

various features can represent the media data in differentaspects and lead to different performances and applications.To evaluate feature performances, several experiments aredesigned.

To recognize how many the LSI reduced dimension shouldbe in our texts, we use a multiclass linear SVM to do doc-ument classification task. The LSI reduced dimension is setto 100, 200, 335, 500, 700, 800 and 1000 respectively. Werandomly select texts of 10%, 20%, and 30% for each con-cept as training data and the rest for testing. Classificationresult is shown in Figure 5. As can be seen, the best result

Page 4: A Real-World Web Cross-Media Dataset Containing Images ...Cross-Media, Real-World, Dataset, Texts, Images, Videos 1. INTRODUCTION Nowadays, there are large amount of heterogeneous

100 200 335 500 700 800 10000.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Acc

urac

y

LSI Recuded Dimension

10% training20% training30% training

Figure 5: Texts classification results.

Table 3: Feature selection result

dimension 400 600 800 1000 1200Accuracy 0.2132 0.2876 0.2425 0.2410 0.2410

is obtained when the LSI dimension is 700. Thus, this is thereason we choose 700 as the dimensionality for each text.For images and videos, there are more than one feature

types. Image features can be divided into Local feature(SIFT) and Global feature (others). Video features con-tain Stip and Mosift. We use the same train/test setting asbaseline recognition with χ2 kernel and report the resultsfor each feature type in Table 2. Accuracy can be improvedby combining different type of features.There are four types of image features with high dimen-

sion in our dataset, some of them may be redundant. Sowe applied the feature selection algorithm [8] to evaluatefeature performance. 30% images are selected for trainingand reduced feature dimension is set to 400, 600, 800, 1000,1200. Experimental results are shown in Table 3. It can beseen that the best dimension is 600.

3.3 Domain AdaptationAs shown in Table 1, video recognition accuracy is low

compared with texts and images. Thus texts and imagescan offer valuable conceptual information for video recogni-tion. Here, texts and images are regarded as two individualsource domains, videos are target domain. We applied theHFA(Heterogeneous Feature Augmentation) Algorithm [2]for domain adaptation. To perform as a showcase for actionrelated video recognition, we choose the category action andscene in our dataset. We randomly select 20 training images(texts) per class from source domain and 20, 30, 40 videosper class are respectively selected from the target domain.SVM with χ2 kernel is compared with HFA. Experimentalresults are shown in Figure 6. It is shown that the perfor-mance can be improved by the domain information trans-ferred from texts and images. This experiment indicates adirection for potential research of cross-domain adaptation.

4. CONCLUSIONIn this paper, we present a real-world web dataset col-

lected from Google, Flickr and YouTube for cross-media re-search. To the best of our knowledge, our proposed datasetis the first cross-media dataset with more than two modal-ities of media types, e.g., the semantically inter-correlatedtexts, images, and videos. Multiple features of each modal-

20 30 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Number of target training videos

Acc

urac

y

SVMText sourceImage Source

Figure 6: Domain adaptation results.

ity are extracted and all of them are publicly available. Ex-periments on baseline classification, feature evaluation anddomain adaptation indicate that it is a good test bed forthe evaluation of cross-media analysis. We hope the datasetserves as an evaluation dataset for researchers all over theworld to discuss their works and recent advances in analysisand computing of cross-media data.

5. ACKNOWLEDGMENTSThis paper was partially supported by the NSFC (under

Grant 61202166) and Doctoral Fund of Ministry of Educa-tion of China (under Grant 20120032120042).

6. REFERENCES[1] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng.

Nus-wide: a real-world web image database from nationaluniversity of singapore. In CVIU, page 48. ACM, 2009.

[2] L. Duan, D. Xu, and I. Tsang. Learning with augmentedfeatures for heterogeneous domain adaptation. ICML, 2012.

[3] A. Habibian, K. E. van de Sande, and C. G. Snoek.Recommendations for video event recognition using conceptvocabularies. In ICMR, pages 89–96. ACM, 2013.

[4] M. J. Huiskes and M. S. Lew. The mir flickr retrievalevaluation. In MIR, pages 39–43. ACM, 2008.

[5] J. Liu, Y. Yang, I. Saleemi, and M. Shah. Learningsemantic features for action recognition via diffusion maps.CVIU, 116(3):361–377, 2012.

[6] J. B. Lovins. Development of a stemming algorithm. MITInformation Processing Group, Electronic SystemsLaboratory, 1968.

[7] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu,L. Kennedy, A. Hauptmann, and J. Curtis. Large-scaleconcept ontology for multimedia. MultiMedia, IEEE,13(3):86–91, 2006.

[8] F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robustfeature selection via joint l2, 1-norms minimization.Advances in Neural Information Processing Systems,23:1813–1821, 2010.

[9] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen.Inter-media hashing for large-scale retrieval fromheterogeneous data sources. In SIGMOD, pages 785–796.ACM, 2013.

[10] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Practicalelimination of near-duplicates from web video search. InMULTIMEDIA, pages 218–227. ACM, 2007.

[11] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu.Columbia university’s baseline detectors for 374 lscomsemantic visual concepts. Columbia University ADVENTtechnical report, pages 222–2006, 2007.