title: visual search and question answering search and... · systems and bring insights on the...

16
Title: Visual Search and Question Answering Instructors : - Lu Jiang Google Cloud AI Email: [email protected] Homepage: http://www.cs.cmu.edu/~lujiang/ - Yannis Kalantidis Facebook Research Email: [email protected] Homepage: https://research.fb.com/people/kalantidis-yannis/ - Liangliang Cao University of Massachusetts Email: [email protected] Homepage: http://llcao.net/ Area and Keywords : Visual search, visual question answering, MemexQA Abstract : Personal photo and video data are being accumulated at an unprecedented speed. For example, 14 petabytes of personal photos and videos were uploaded to Google Photo1 by 200 million users in 2015, while a tremendous amount of personal photos and videos are also being uploaded to Flickr every day. How to efficiently search and organize such data presents a huge challenge to both academic research and industrial applications. To attack this challenge, this tutorial will review the research efforts in related subjects and showcases of successful industrial systems. We will discuss traditional visual search methods and the improvement of visual presentations brought by deep neural networks. The instructors will also share their experience of building large-scale fashion search and Flickr similarity search 1

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Title: Visual Search and Question Answering

Instructors : - Lu Jiang

Google Cloud AI

Email: [email protected]

Homepage: http://www.cs.cmu.edu/~lujiang/

- Yannis Kalantidis

Facebook Research

Email: [email protected]

Homepage: https://research.fb.com/people/kalantidis-yannis/

- Liangliang Cao

University of Massachusetts

Email: [email protected]

Homepage: http://llcao.net/

Area and Keywords : Visual search, visual question answering, MemexQA

Abstract: Personal photo and video data are being accumulated at an unprecedented speed. For

example, 14 petabytes of personal photos and videos were uploaded to Google Photo1 by 200

million users in 2015, while a tremendous amount of personal photos and videos are also being

uploaded to Flickr every day. How to efficiently search and organize such data presents a huge

challenge to both academic research and industrial applications.

To attack this challenge, this tutorial will review the research efforts in related subjects and

showcases of successful industrial systems. We will discuss traditional visual search methods

and the improvement of visual presentations brought by deep neural networks. The instructors

will also share their experience of building large-scale fashion search and Flickr similarity search

1

Page 2: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

systems and bring insights on the challenges of extending the academic research to industrial

applications.

This tutorial will discuss the queries and logs of search engines, and analyze how to address

the characteristics of personal media search. By leveraging searching techniques to visual

question answering, this tutorial will introduce a new task named MemexQA: given a collection

of photos or videos from the user, can we automatically answer questions that help users

recover their memory about events captured in the collection? New datasets and algorithms of

MemexQA will be reviewed. We hope MemexQA will shed light on the next generation computer

interface of exploding amount of personal photos and videos.

Full Description of the Tutorial (no more than 4 pages, including the learning objectives,

tutorial outline, detailed modules, target audience, prerequisite knowledge required, importance

and relevance of the tutorial, etc.)

Length of the tutorial:

3 hours

Tutorial outline :

1. Overview of Visual Search and Understanding (Liangliang, about 1 hour)

1.1. Classic Visual Search

1.2. Visual Question Answer for Single Images

1.3. Delving Deep into Personal Search

2. Visual Representations and Indexing (Yannis, about 1 hour)

2.1. Visual Indexing and Flickr Similarity Search

2.2. State-of-the-art Representation for Images and Videos

2.3. Double Attention Network

3. MemexQA (Lu, about 1 hour)

3.1. Memex Dataset

3.2. Focal Visual-Text Attention for Visual Question Answering

3.3. Comparison with Video QA and Video Search

2

Page 3: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Related Publication from the Instructors : ● L. Jiang , Y. Kalantidis, L. Cao , S. Farfade, J. Tang, Delving Deep into Personal Photo

and Video Search, WSDM 2017

● J. Liang, L. Jiang , L. Cao, L.-J. Li, A. Hauptmann, Focal Visual-Text Attention for Visual

Question Answering, CVPR 2018

● Y. Chen, Y. Kalantidis, J. Li, Y. Shuicheng, J. Feng. A^2-Nets: Double Attention

Networks. NIPS, 2018.

● Y. Chen, Y. Kalantidis, J. Li, Y. Shuicheng, J. Feng. Multi-Fiber Networks for Video

Recognition. ECCV, 2018.

● S.-I Yu , L. Jiang , Z. Xu, Y. Yang, A. Hauptmann, Content-Based Video Search over 1

Million Videos with 1 Core in 1 Second, ICMR 2015

Targeted audience and required prerequisites : The target audience for this tutorial is graduate students with the background in computer

vision, machine learning and multimedia, and/or are interested in the state-of-the-art industrial

systems. It assumes basic knowledge of deep learning.

Learning objectives : The objectives of this tutorial include:

● Providing a review of the recent development in visual search and question answering

● Bridging the gap between academic research and industrial development

● Introducing the MemexQA, a novel task leveraging both visual search and question

answering

Importance and relevance of the tutorial: This tutorial is the first effort of leveraging both visual search and question answering into

one course. Most of the existing materials focus on one single topic, and hence overlook the

problem of answering questions based on the whole collection of personal photos, instead of a

single image. We demonstrate that by incorporating a visual search engine inside a

state-of-the-art, end-to-end trainable visual question answering architecture, one can reason

over an arbitrarily large media database efficiently and derive the desired answer.

3

Page 4: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

The instructors of this tutorial have rich experience in both academia and industry. We

believe our experience will be interesting to the ICME community and helpful to the students for

their future career. We hope our tutorial will help to enhance the impact of the ICME community.

4

Page 5: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Last updated July 2018Lu Jiang (Ph.D)

Contact

Information

Cloud AI, Google AI +1 (412) 897-5924

1155 Borregas Avenue [email protected]

Sunnyvale, CA, 94089 http://www.cs.cmu.edu/~lujiang

Research

Interests

My research goal is to solve real problems on big data. My research area is in the interdisciplinary

filed of Multimedia, Machine Learning, Computer Vision, Information Retrieval and Big Data,

which specifically, includes video understanding and search, weakly supervised learning on noisy

data, vision+language, cloud, etc.

Education Carnegie Mellon University 2011 - 2017

Ph.D. in Artificial Intelligence. (GPA: 4.12/4.33)

Advisors: Prof. Alexander Hauptmann and Prof. Teruko Mitamura

Thesis: Web-scale Multimedia Search for Internet Video Content

Free University of Brussels 2010 - 2011

M.Sc. in Computer Science. (Erasmus Mundus Exchange Program)

Xi’an Jiaotong University 2008 - 2011

M.Sc. in Computer Science. (GPA: 3.62/4.00, Rank: 1/142), Advisor: Prof. Jun Liu

B.Eng. in Software Engineering, 2004-2008. (Major GPA: 3.88/4.00)

Honors & Awards Yahoo! Fellowship. 2016

NIPS, AAAI, SIGMM, SIGIR, SIGKDD, SIGWEB, NSF student travel grants.

Best Poster at IEEE Spoken Language Technology 2014

Best Paper candidate at International Conference on Multimedia Retrieval 2014, 2015

–Only 2% of all submitted papers were nominated for best paper candidates. Received it twice.

Best performer on multimedia event detection at NIST TRECVID 2013, 2014

–Key contributor to our winning system. Participants included 30 companies and research institutes.

Best performer on surveillance event detection at NIST TRECVID 2011

Erasmus Mundus Tandem scholarship 2010

–Fellowship from the Europe Union for studying in Europe. Only 41 master students in China were awarded.

Fuji Xerox Fellowship 2010

Samsung Fellowship 2009

IBM excellent student in China 2007

Research

Experience

Research Scientist at Google AI Cloud AI May 2017 - Now

• Democratize cloud machine learning and video intelligence.

Research Assistant at Carnegie Mellon University Sep. 2011 - April 2017

• The key contributor to a five-year IARPA project. The project is to approach an automatic methodto detect the event in Internet videos without any user-generated metadata. Proposed the first of itskind zero-shot system, which not only achieves the top performance in NIST TRECVID evaluation2013-2015, 3 years in a row, but also scales the search up to 100 million videos. The developedtechniques lead to patterns and inventions.

Intern Scientist at Yahoo Research May 2016 - August 2016

• The project is on the large-scale personal photo and video search on Flickr. Proposed deep queryunderstanding models that boost the state-of-the-art (word2vec) search accuracy by a relative 45%.In addition, analyzed big personal media search logs on Flickr, and discovered distinguishing charac-teristics of this novel problem.

Intern at Google Research Feb. 2016 - May 2016

• The project is on training concept detectors on big weakly-labeled data of YouTube using Tensorflow.Proposed novel webly-labeled learning method which improves the state-of-the-art accuracy by Y%.In addition, slashed training time from a few days to a few hours.

Page 6: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Lu Jiang’s CV ([email protected]) 2

Research Intern at Microsoft Research Asia May 2010 - August 2010

• Designed and implemented a novel pattern/logic engine for retrieval and data mining system. Thecode was used in the launched product.

Research Assistant at Xi’an Jiaotong University Sep. 2008 - July 2010

• The project is a National High-Tech R&D Program, which aims at discovering and managing theeducational resources on the Internet. Designed and implemented domain term recognition and titleextraction for World, PPT, PDF, and HTML. Proposed algorithms to recognize the associations inknowledge networks; designed and implemented the knowledge element association detector.

Selected

Publications

(h-index=21)

Google Scholar

[1] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, Li Fei-Fei. MentorNet:Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted La-bels. In International Conference on Machine Learning (ICML), 2018

ICML 18

[2] Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander Hauptmann. Fo-cal Visual-Text Attention for Visual Question Answering. In Computer Vision andPattern Recognition (CVPR), 2018

CVPR 18(spotlight)

[3] Zelun Luo, Jun-Ting Hsieh, Lu Jiang, Juan Carlos Niebles, Li Fei-Fei. GraphDistillation for Action Detection with Privileged Information in RGB-D Videos. InEuropean Conference on Computer Vision (ECCV), 2018

ECCV 18

[4] Yu Wu, Linchao Zhu, Lu Jiang, Yi Yang. Decoupled Novel Object Captioner. InACM Multimedia (MM), 2018 MM 18

[5] Lu Jiang, Yannis Kalantidis, Liangliang Cao, Sachin, Farfade, Jiliang Tang, AlexHauptmann. Delving Deep into Personal Photo and Video Search. In Web Searchand Data Mining (WSDM), 2017.

WSDM 17

[6] Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann. Leveraging Multi-modal Prior Knowledge for Large-scale Concept Learning in Noisy Web Data. InACM International Conference on Multimedia Retrieval (ICMR), 2017.

ICMR 17(oral)

[7] Junwei Liang, Lu Jiang, Alexander Hauptmann. Temporal Localization of AudioEvents for Conflict Monitoring in Social Media. In IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), 2017.

ICASSP 17

[8] Zhang, Dingwen, Junwei Han, Lu Jiang, Senmao Ye, and Xiaojun Chang. Reveal-ing event saliency in unconstrained video collection. IEEE Transactions on ImageProcessing 26, no. 4 (2017): 1746-1758.

TIP 17

[9] Lu Jiang. Web-scale Multimedia Search for Internet Video Content. In Interna-tional Conference on World Wide Web (WWW), 2016. WWW 16

[10] Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann. Learning to De-tect Concepts from Webly-Labeled Video Data. In Joint Conference on ArtificialIntelligence (IJCAI), 2016.

IJCAI 16

[11] Lu Jiang, Shoou-I Yu, Deyu Meng, Yi Yang, Teruko Mitamura, Alexander Haupt-mann. Fast and Accurate Content-based Semantic Search in 100M Internet Videos.In ACM Multimedia (MM), 2015.

MM 15

[12] Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann.Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos.In ACM International Conference on Multimedia Retrieval (ICMR), 2015.

ICMR 15(best papercandidate)

[13] Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, Alexander Hauptmann.Self-paced Learning for Matrix Factorization. In Conference on Artificial Intelligence(AAAI), 2015.

AAAI 15(oral)

[14] Dingwen Zhang, Deyu Meng, Li Chao, Lu Jiang, Zhao Qian, Junwei Han. Aself-paced multiple-instance learning framework for co-saliency detection. In IEEEInternational Conference on Computer Vision (ICCV), 2015.

ICCV 15

[15] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhen-Zhong Lan, Shiguang Shan, AlexanderHauptmann. Self-paced Learning with Diversity. In Neural Information ProcessingSystems (NIPS), 2014.

NIPS 14

Page 7: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Lu Jiang’s CV ([email protected]) 3

[16] Lu Jiang, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Easy SamplesFirst: Self-paced Reranking for Zero-Example Multimedia Search. In ACM Multime-dia (MM), 2014.

MM 14

[17] Yajie Miao, Lu Jiang, Hao Zhang, Florian Metze. Improvements to SpeakerAdaptive Training of Deep Neural Networks. In IEEE Spoken Language Technology(SLT), 2014.

SLT 14(best poster)

[17] Lu Jiang, Wei Tong, Deyu Meng, Alexander Hauptmann. Towards EfficientLearning of Optimal Spatial Bag-of-Words Representations. In ACM InternationalConference on Multimedia Retrieval (ICMR). 2014.

ICMR 14(best papercandidate)

[18] Lu Jiang, Yajie Miao, Yi Yang, Zhen-Zhong Lan, Alexander Hauptmann. ViralVideo Style: A Closer Look at Viral Videos on YouTube. In ACM InternationalConference on Multimedia Retrieval (ICMR), 2014.

ICMR 14

[19] Shoou-I Yu, Lu Jiang, et al. CMU-Informedia@TRECVID 2014. In NISTTRECVID Video Retrieval Evaluation Workshop (TRECVID), 2014.

TRECVID 14(best system)

[20] Lu Jiang, Alexander Hauptmann, Guang Xiang. Leveraging High-level and Low-level Features for Multimedia Event Detection. In ACM Multimedia (MM), 2012. MM 12

[21] Jun Liu, Lu Jiang, Zhaohui Wu, Qinghua Zheng, Yanan Qian. Mining Learning-Dependency between Knowledge Units from Text. The International Journal on VeryLarge Data Bases(VLDBJ), 20(3): 335-345, 2011.

VLDBJ 11

Pattern and

Inventions

[1] Event Labeling through Analytic Media Processing, Disclosure of Invention, KeyInventor. No. 0247601-15-0048, 2016.[2] Large-scale video content retrieval through text query, Invention, Second Inventor,U.S. Serial No. 62/285,256 2015.[3] A Features Dictionary Generating Method for Text Classification based on LZWCompression Algorithm. Authorization No. ZL 2008 1 0232557.2 2008 (in Chinese).[4] A Deep Web Adaptive Crawling Method based on Minimum Executable Pattern.Application No. 200810232555.3, 2008. (in Chinese).

Professional

Service

Technical Program Committee Member:

ACM Multimedia 2013-2018 (Author Advocator Co-chair 2017)CVPR 2018AAAI 2017-2018IJCAI 2017IA-Summit 2016Pacific Rim Conference on Multimedia 2014Co-chair of CMU LTI Student Research Symposium 2015

Journal and Book Review:

Journal of SupercomputingJournal on Signal ProcessingJournal on Information FusionIEEE Transactions on Multimedia (TMM)Journal of Machine Learning Research (JMLR)Computer Vision and Image Understanding (CVIU)Springer Computing (Book Chapter review)Journal of Intelligent Information SystemsElectronic Commerce Research and ApplicationsInternational Journal of Automation and ComputingIEEE Transactions on Neural Networks and Learning SystemsIEEE Transactions on Circuits and System for Video TechnologyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)ACM Transactions on Multimedia Computing, Communications and Applications

Former Interns

or Students

Junwei Liang (CMU Master next==) CMU Ph.D.)

Zelun Luo (Stanford Master next==) Stanford Ph.D.)

Nam Vo (Georgia Tech Ph.D.)

Page 8: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Lu Jiang’s CV ([email protected]) 4

Yunbo Wang (Tsinghua University Ph.D. next==) MIT Visiting Student)

Alejandro Newell (University of Michigan Ph.D.)Serena Yeung (Stanford Ph.D. next

==) Stanford Assistant Professor)Justin Johnson (Stanford Ph.D.)

Talks Web-scale Multimedia Search for Internet Video Content.

Yahoo Research, San Francisco, CA, June 2016.University of Southern California ISI, Los Angeles, CA, Jan. 2016.Columbia University, New York City, NY, Dec. 2015.Vision And Learning SEminar (VALSE), Beijing, China, Feb. 2014.

Learning to Detect Concepts from Webly-Labeled Video Data.

Google Research, Mountain View, CA, March 2016.

CMU@TRECVID: Large-scale Semantic Indexing.

University of Central Florida, Orlando, FL, Nov. 2014

CMU@TRECVID: Multimedia Event Detection.

National Institute of Standards and Technology, Gaithersburg, MD, Nov. 2013.

CMU@TRECVID: Surveillance Event Detection.

National Institute of Standards and Technology, Gaithersburg, MD, Nov. 2011.

Knowledge Graph Analysis.

Free University of Brussels, Brussels, Belgium, Oct. 2010.

Page 9: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Yannis Kalantidis4100 Manila Ave.

Oakland, CA, 94609

US Permanent Resident

H +1-415-254-1235

B [email protected]

Õ www.skamalas.com

Education2009–2014 Ph.D. in Computer Science, School of Electrical and Computer Engineering, National Technical

University of Athens, Greece.Ph.D. Thesis: Clustering and nearest neighbor methods for visual search

2002–2009 Diploma/M.Eng. in Electrical and Computer Engineering, School of Electrical and Com-

puter Engineering, National Technical University of Athens, Athens, Greece.Diploma Thesis: Content- and geometry-based image retrieval

Working experienceFeb 2017 –

PresentResearch Scientist, Facebook Research, Menlo Park, CA,Research and development on video representations, temporal segmentation, web-scale learning,visual relationship detection, multi-modal learning.

Jan 2015 –Dec 2016(2 years)

Postdoctoral Researcher & Research Scientist (2016), Yahoo Research, San Francisco, CA,Research and development on web-scale visual search, distributed backend for approximatenearest neighbor search, image & video representation, mutlimodal classification, native adselection, personal media search. Collaborated with Stanford on the Visual Genome project,https://visualgenome.org/, [IJCV, 2017].

2009–2014(5 years)

Image Video and Multimedia Laboratory, School of Electrical and Computer Engineering,

National Technical University of Athens, Research and Development in FP7 European Projects,Notably:2009: Imagination (under ICCS/NTUA)2009-2011: WEKNOWIT (under CERTH/ITI) .

Scientific Publication, Patent & Citation RecordsPublications Papers published in CVPR, NIPS, ICCV, ECCV, IJCV, CVIU, ACM MM, ICMR, WSDM, CHI

and other.Citations Google Scholar citations: 1229 citations, h-index: 13, i10-index: 15 (Sept. 2018)

Patents 2 US Patents, 4 US Patent Applications & 2 Defensive Publications.

Teaching experienceFall 2018

(in planning)Mini-Course on Computer Vision, UCSF Department of Radiation Oncology, University ofCalifornia, San Francisco, Invited by Prof Gilmer Valdes. In planning stage, details and dates tobe determined.

Fall 2016 Guest lecturer at CS131 - Computer Vision: Foundations and Applications, Computer

Science Department, Stanford University, Course taught by Prof. Fei-Fei Li and Juan CarlosNiebles.I was invited to give three lectures on basic computer vision principles (motion, gestalt and segmentation).

2009–2014 Image and Video Analysis, Lab assistant/lecturer, School of Electrical and Computer Engi-

neering, National Technical University of Athens, Greece.

Page 10: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

2010–2014 Programming Techniques, Teaching Assistant, School of Electrical and Computer Engineer-

ing, National Technical University of Athens, Greece.2009–2010 Introduction to Programming, Teaching Assistant, School of Electrical and Computer Engi-

neering, National Technical University of Athens, Greece.

InternshipsMay–July

2010Research Intern at Yahoo Research Barcelona, supervisor: Roelof Van Zwol, collaborators:

Lluis Garcia Pueyo, Michele Trevisiol, Barcelona, Spain.May–August

2012Research Intern at Yahoo Research Santa Clara, supervisor: Lyndon Kennedy, collaborators:

Li-Jia Li, David A. Shamma, San Francisco & Santa Clara, USA.May–August

2014Research Intern at Yahoo Labs San Francisco, supervisor: David A. Shamma, collaborators:

Lyndon Kennedy, San Francisco, USA.

Awards & Scholarships2017 Outstanding Reviewer Award, International Conference on Computer Vision, (ICCV).2016 Best Paper Award, NIPS Large-Scale Computer Vision Systems Workshop.2015 Outstanding Reviewer Award, International Conference on Computer Vision, (ICCV).2013 Thomaidion Award for Scientific Publication, Getting the Look: Clothing Recognition and

Segmentation for Automatic Clothing Suggestions in Everyday Photos, National TechinicalUniversity of Athens.

2012 Thomaidion Award for Scientific Publication, VIRaL: Visual Image Retrieval and Localization,National Techinical University of Athens.

2011 Thomaidion Award for Scientific Publication, Retrieving landmark and non-landmark images

from community photo collections, National Techinical University of Athens.2005 Erasmus Scholarship, Exchange Studies at Lund Univesity, Lund, Sweden.

Development projectsVIRaL Principal developer in the Visual Image Retrieval and Localization tool (On-line demo:

http://viral.image.ntua.gr).LOPQ Principal developer in the open source code for Locally Optimized Product Quantization (Available

on Yahoo’s github: https://github.com/yahoo/lopq).Crow Principal developer in the open source code for Cross-dimensional feature Weighting for convolu-

tional features (Available on Yahoo’s github: https://github.com/yahoo/crow).

Invited TalksComputer Vision Center (CVC), Learning Spatiotemporal Video Representations, Invited talk,September 18th 2018, Barcelona, Spain.University of California Berkeley, Learning Video Representations, Guest Lecturer at CS 189,August 8th 2018, Berkeley, CA.University of San Francisco, Pushing the Frontiers of Video Understanding, Seminar Series inData Science, April 20th 2018, San Francisco, CA.Universita Politecnica Catalunya, Computer Vision at Facebook research, Deep Learning OpenLecture, November 20th 2017, Barcelona, Spain.Columbia University, Digital Video and Multimedia Laboratory, invited by Prof. Shih-Fu Chang,August 19th 2014, New York, NY.Stanford University, Image, Video and Multimedia Systems Group, invited by Prof. BerndGirod, August 1st 2014, Stanford, CA.

Page 11: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

FXPAL, Nearest Neighbor Search and Clustering for Large Scale Visual Search, Fuji-Xerox PaloAlto Laboratory, July 29th 2014, Palo Alto, CA.Stanford University, Stanford Vision Lab, invited by Prof. Fei-Fei Li, July 28th 2014, Stanford,CA.

Programming skillsProgramming Python, C/C++, Matlab, PIG.

Familiar with Java, HIVE.Other Ca�e, Tensorflow, Hadoop, Spark, OpenCV, LATEX, Pgfplots

Reviewer in journals2010–today Multimedia Tools and Applications, [MTAP], Springer.2012–today Signal Processing: Image Communication, [SPIC], Elsevier.2014–today Transactions on Circuits and Systems for Video Technology, [CSVT], IEEE.2014–today Transactions on Multimedia, [TMM], IEEE.2015–today Transactions on Image Processing, [TIP], IEEE.2015–today Transactions on Intelligent Systems and Technology, [TIST], ACM.2015–today Signal Processing Letters, [SPL], IEEE.2015–today Transactions on Knowledge and Data Engineering, [TKDE], IEEE.2015–today The VLDB Journal, [VLDBJ], Springer.2016–today Transactions on Pattern Analysis and Machine Intelligence, [PAMI], IEEE.2016–today Computer Vision and Image Understanding, [CVIU], Elsevier.

Committee member in conferences & workshops2016–today IEEE Conference on Computer Vision and Pattern Recognition, Technical Program Com-

mittee Member, [CVPR].2016–today IEEE European Conference on Computer Vision, Technical Program Committee Member,

[ECCV].2015–today IEEE International Conference on Computer Vision, Technical Program Committee Member,

[ICCV], (Outstanding Reviewer Award in 2015 and 2017).2013–today ACM Multimedia Conference, Technical Program Committee Member, [ACM-MM].

2011 IEEE Workshop on Content-based Multimedia Indexing, Technical Program Committee

Member, [CBMI].2009 IEEE Workshop on Content-based Multimedia Indexing, Local Chair Member, [CBMI].

LanguagesGreek Native

English Fluent Certificate of Proficiency in English

German Basic Zertificat Deutsch als Fremdsprache, Test DaF (grade 3)

Page 12: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

List of PublicationsSelected publicationsY. Chen, Y. Kalantidis, J. Li, Y. Shuicheng, J. Feng. Double Attention Networks. In Neural Information

Processing (NIPS), 2018.

Y. Chen, Y. Kalantidis, J. Li, Y. Shuicheng, J. Feng. Multi-Fiber Networks. In European Conference on

Computer Vision (ECCV), 2018.

J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri A. Elgammal, M. Elhoseiny. Large-Scale VisualRelationship Understanding. In arxiv:1804.10660, 2018.

L. Jiang, L. Cao, Y. Kalantidis, S. Farfade and A. Hauptmann. MemexQA: Visual Memex QuestionAnswering. In Submitted to TPAMI, 2018.

L Jiang, LL Cao, Y. Kalantidis, S Farfade, AG Hauptmann. Visual Memory QA: Your Personal Photoand Video Search Agent In AAAI, demo presentation, 2017.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li,D.A. Shamma, M. Bernstein and L. Fei-Fei. Visual Genome: Connecting Language and Vision UsingCrowdsourced Dense Image Annotations. In International Journal of Computer Vision (IJCV), 2017.

L. Jiang, Y. Kalantidis, L. Cao, S. Farfade, J. Tang and A. Hauptmann. Delving Deep into PersonalPhoto and Video Search. In Web Search and Data Mining (WSDM), 2017.

P. Garrigues, S. Farfade, H. Izadinia, K. Boakye and Y. Kalantidis. Tag Prediction at Flickr: a Viewfrom the Darkroom. In Large Scale Computer Vision Systems Workshop, NIPS, 2016, Best Paper Award

Winner.

Y. Kalantidis, C. Mellina and S. Osindero. Cross-dimensional Weighting for Aggregated Deep ConvolutionalFeatures. In Web-scale Vision and Social Media (VSM) Workshop, ECCV, 2016.

Y. Kalantidis, A. Farahat, L. Kennedy, R. Baeza-Yates and D.A. Shamma. Visual Congruent Ads forImage Search. In International Conference on Pattern Recognition (ICPR), 2016.

Y. Kalantidis, L. Kennedy, H. Nguyen, C. Mellina and D.A. Shamma. LOH and behold: Web-scale visualsearch, recommendation and clustering using Locally Optimized Hashing. In Web-scale Vision and Social

Media (VSM) Workshop, ECCV, 2016.

Y. Avrithis, Y. Kalantidis, E. Anagnostopoulos and I. Z. Emiris. Web-scale image clustering revisited. InInternational Conference on Computer Vision (oral) (ICCV), Santiago, Chile, December 2015.

Y. Kalantidis and Y. Avrithis. Locally Optimized Product Quantization for Approximate Nearest NeighborSearch. In Computer Vision and Pattern Recognition (CVPR), Columbus, OH, June 2014.

G. Tolias, Y. Kalantidis, and Y. Avrithis. Towards large-scale geometry indexing by feature selection.Computer Vision and Image Understanding (CVIU), March 2014.

Y. Kalantidis, L. Kennedy and L.-J. Li. Getting the Look: Clothing Recognition and Segmentation forAutomatic Clothing Suggestions in Everyday Photos. In International Conference on Multimedia Retrieval

(Oral paper) (ICMR), Dallas, TX, April 2013.

Y. Avrithis and Y. Kalantidis. Approximate gaussian mixtures for large scale vocabularies. In European

Conference on Computer Vision (ECCV), Florence, Italy, October 2012.

G. Tolias, Y. Kalantidis, and Y. Avrithis. Symcity: Feature selection by symmetry for large scale imageretrieval. In ACM Multimedia (Oral paper) (MM 2012), Nara, Japan, October 2012. ACM.

Y. Avrithis, Y. Kalantidis, G. Tolias, and E. Spyrou. Retrieving landmark and non-landmark images fromcommunity photo collections. In ACM Multimedia (Oral paper) (ACM MM 2010), Firenze, Italy, October2010.

Y. Avrithis, G. Tolias, and Y. Kalantidis. Feature map hashing: Sub-linear indexing of appearance andglobal geometry. In ACM Multimedia (Oral paper) (ACM MM 2010), Firenze, Italy, October 2010.

Y. Kalantidis, LG. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis. Scalable triangulation-based logorecognition. In International Conference on Multimedia Retrieval, Trento, Italy, April 2011.

Y. Kalantidis, G. Tolias, Y. Avrithis, M. Phinikettos, E. Spyrou, P. Mylonas, and S. Kollias. Viral: Visualimage retrieval and localization. Multimedia Tools and Applications (MTAP), 2011.

Page 13: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

LIANGLIANG CAO

[email protected]

Room 360, Computer Science Building, UMass, Amherst, MA, 01002

http://llcao.net

EMPLOYMENT

Research Associated Professor, University of Massachusetts 2018 - nowCo-founder, Switi Inc, New York City 2016 - now

WORK HISTORY

Adjunct Associate Professor, Columbia University in the City of New York 2018Adjunct Assistant Professor, Columbia University in the City of New York 2013 - 2017Senior Research Scientist, Yahoo Labs at New York City 2015 - 2016Research Sta↵ Member, IBM T. J. Watson Research Center, NY 2011 - 2015

EDUCATION

Ph.D., Dept. Electrical and Computer Engineering, University of Illinois at Urbana-Champaign 2011M.Phil., Dept. Information Engineering, The Chinese University of Hong Kong 2005B. E., Dept. Electronic Engineering, University of Science and Technology of China 2003

RESEARCH INTERESTS

Artificial Intelligence including Computer Vision, Multimedia and Big Data

SELECTED HONORS

- ACM SIGMM Rising Star Award, 2017- IBM Outstanding Accomplishment for Multimedia Team, 2012- Best Paper Award, International Workshop on Big Data Mining, 2012- IBM Research Division Awards, 2011 - 2013- IBM Watson Emerging Leader in Multimedia and Signal Processing, 2010- First Place in ImageNet Large Scale Visual Recognition Challenge, 2010

PUBLICATION SUMMARY

Google Scholar total 4000+ citations, H-Index 30, 10 patents.12 papers receive more than one hundred citations. Total 100 publications.http://scholar.google.com/citations?user=S-hBSfIAAAAJ&hl=en

SELECTED PUBLICATIONS

Journals

9. Y. Li, L. Cao, J. Zhu, J. Luo, Mining Fashion Outfit Composition Using An End-to-End DeepLearning Approach on Set Data, IEEE Transaction on Multimedia, Vol. 9, No. 8, pp. 1946-1955,2017.

8. L. Wang, X. Zhao, Y. Si, L. Cao, Y. Liu. Context-associative Hierarchical Memory Model forHuman Activity Recognition and Prediction. IEEE Transaction on Multimedia, Vol. 19, No. 3:pp. 646-659, 2017.

7. Q. You, R. Pan, L. Cao, J. Luo, Image Based Appraisal of Real Estate Properties, IEEE Trans-

action on Multimedia, Vol 19, No. 8, pp. 1946-1955, 2017.

1

Page 14: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

6. Q. You, L. Cao, Y. Cong, X. Zhang, and J. Luo, A Multifaceted Approach to Social Multimedia-based Prediction of Elections, IEEE Transaction on Multimedia, Vol 17, No. 12, 2015.

5. G.-J. Qi, S.-F. Tsai, M.-H. Tsai, L. Cao, and T. S. Huang, Web-Scale Multimedia InformationNetworks. Proceedings of the IEEE (PIEEE), Vol. 100, No. 9, p. 2688-2704, 2012.

4. L. Cao, X. Jin, Z. Yin, A. Del Pozo, J. Luo, J. Han and T. S. Huang, RankCompete: SimultaneousRanking and Clustering of Information Networks, Neurocomputing, vol. 95, p. 98-104, 2011.

3. L. Cao, J. Luo, H. Kautz and T. S. Huang, Image Annotation within the Contest of PersonalPhoto Collections, IEEE Transactions on Multimedia, Vol. 11, No. 2, p. 208-219, 2009.

2. J. Liu, L. Cao, Z. Li and X. Tang. Plane-Based Optimization for 3D Object Reconstructionfrom Single Line Drawings. IEEE Transactions on Pattern Analysis and Machine Intelligence

(TPAMI), vol. 30 (2), p. 315-327, 2008.

1. L. Cao, J. Liu and X. Tang. What the Back of the Object Looks Like: 3D Reconstructionfrom Line Drawings Without Hidden Line. IEEE Transactions on Pattern Analysis and Machine

Intelligence (TPAMI), vol. 30 (3), p. 507-517, 2008.

Conferences

28. J. Liang, L. Jiang, L. Cao, A. Hauptmann, Answer with Grouding Snippets: Focal Visual-TextAttention for Visual Question Answering, IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2018.

27. Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, J. Li, Learning from Noisy Labels with Distillation, IEEEInternational Conference on Computer Vision (ICCV), 2017.

26. L. Jiang, Y. Kalantidis, L. Cao, S. Farfade, J. Tang and A. Hauptmann, Delving Deep into PersonalPhoto and Video Search, International Conference on Web Search and Data Mining (WSDM),2017.

25. L. Cao, J. Hsiao, P. de Juan, Y. Li and B. Thomee, Incremental Learning for Fine-Grained ImageRecognition, ACM International Conference on Multimedia Retrieval (ICMR), 2016.

24. C. Wang, L. Cao, J. Fan, Building Joint Spaces for Relation Extraction, International Joint Con-ference on Artificial Intelligence (IJCAI), 2016.

23. Y. Li, Y. Song, L. Cao, J. Tetreault, J. Luo, TGIF: A New Dataset and Benchmark on AnimatedGIF Description, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.(Spotlight)

22. M. Gygli, Y. Song, L. Cao, Video2GIF: Automatic Generation of Animated GIFs from Video,IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

21. R. Schifanella, P. de Juan, J. Tetreault, L. Cao, Detecting Sarcasm in Multimodal Social Platforms,ACM Conference on Multimedia (ACM MM), 2016.

20. N. Yakovenko, L. Cao, C. Ra↵el, and J. Fan, Poker-CNN: A Pattern Learning Strategy for MakingDraws and Bets in Poker Games, AAAI, 2016.

19. C. Wang, L. Cao, B. Zhou, Medical Synonym Extraction with Concept Space Models, Interna-tional Joint Conferences on Artificial Intelligence (IJCAI), 2015.

18. Q. Chen, Z. Song, R. Feris, A. Datta, L. Cao, Z. Huang, S. Yan, E�cient Maximum Appear-ance Search for Large-Scale Object Detection, IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2013.

2

Page 15: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

17. F. Yu, L. Cao, R. Feris, J. R. Smith, S.-F. Chang, Designing Category-Level Attributes for Dis-criminative Visual Recognition, IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2013.

16. Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, J. R. Smith, Learning Locally-Adaptive DecisionFunctions for Person Verification IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2013.

15. L. Cao, L. Gong, J. R. Kender, N. Codella, J. R. Smith, Learning by Focusing: A New Frameworkfor Concept Recognition and Feature Selection, IEEE International Conference on Multimedia

and Expo, 2013.

14. L. Cao, Z. Li, Y. Mu, S.-F. Chang, A Unified Submodular Framework towards Video Pooling andHashing, ACM Conference on Multimedia (ACM MM), 2012. (Oral presentation)

13. L. Cao, Y. Mu, A. Natsev, S.-F. Chang, G. Hua, J. R. Smith, Scene Aligned Pooling for ComplexVideo Recognition, European Conference on Computer Vision (ECCV), 2012.

12. L. Cao, J. Smith, Z. Wen, Z. Yin, X. Jin, J. Han, BlueFinder: Estimate Where a Beach PhotoWas Taken, International Conference on World Wide Web (WWW), 2012.

11. L. Cao, H. D. Kim, M.-H. Tsai, B. Cho, Z. Li, I. Gupta, C. Zhai, and T. S. Huang, Delta-SimRankComputing on MapReduce, 1st International Workshop on Big Data, Streams and Heterogeneous

Source Mining (BigMine), 2012. (Best Paper Award)

10. Z. Li, H. Ning, L. Cao, T. Zhang, Y. Gong, T. S. Huang, Learning to Search E�ciently in HighDimensions, Advances in Neural Information Processing Systems (NIPS), 2011.

9. Z. Yin, L. Cao, J. Han, C. Zhai, T. S. Huang, Geographical Topic Discovery and Comparison,International Conference on World Wide Web (WWW), 2011. (Oral presentation)

8. Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, T. S. Huang , Large-scale ImageClassification: Fast Feature Extraction and SVM Training, IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2011.

7. L. Cao, Z. Liu, and T. S. Huang, Cross-dataset Action Detection, IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2010.

6. Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang, Event Detection in Complex Scenes withSpatial and Temporal Ambiguities, IEEE International Conference on Computer Vision (ICCV),2009. (Oral presentation)

5. L. Cao, J. Luo, F. Liang, and T. S. Huang, Heterogeneous Feature Machines for Visual Recogni-tion, IEEE International Conference on Computer Vision (ICCV), 2009.

4. L. Cao, J. Luo, H. Kautz and T. S. Huang, Annotating Collections of Photos Using HierarchicalEvent and Scene Models, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2008.

3. L. Cao, and L. Fei-Fei, Spatially Coherent Latent Topic Model for Concurrent Segmentation andClassification of Objects and Scenes, IEEE International Conference Computer Vision (ICCV),2007.

2. L. Cao, J. Liu and X. Tang, Degen Generalized Cylinders and Their Properties. European Con-

ference on Computer Vision (ECCV), 2006.

1. L. Cao, J. Liu and X. Tang. 3D Object Reconstruction from a Single 2D Line Drawing withoutHidden Lines. IEEE International Conference on Computer Vision (ICCV), 2005.

PROFESSIONAL SERVICES

3

Page 16: Title: Visual Search and Question Answering Search and... · systems and bring insights on the challenges of extending the academic research to industrial applications. This tutorial

Society

· IEEE, senior member

· ACM, member

Conference/Workshop Organizer

· Industrial chair, ACM International Conference on Multimedia Retrieval (ICMR), 2017

· Area chair, ACM Conference on Multimedia (ACM MM), 2017

· Area chair, IEEE Winter Conference on Applications and Computer Vision (WACV) 2017

· Workshop chair, ACM International Conference on Multimedia Retrieval (ICMR), 2016

· Area chair, IEEE Winter Conference on Applications and Computer Vision (WACV), 2014

· Chair, Greater New York Area Multimedia and Vision Meeting, 2012

· Chair, Greater New York Area Multimedia and Vision Meeting, 2013

· Area chair, ACM Conference on Multimedia, 2012

· Founder and chair, ACM Workshop on GeoMM, 2012-2015

Journal Reviewer :

· IEEE Transaction on Pattern Analysis and Machine Intelligence

· IEEE Transaction on Knowledge and Data Engineering

· International Journal of Computer Vision

· IEEE Transaction on Multimedia

· NeuroComputing

· Pattern Recognition

· Computer Vision and Image Understanding

· ACM Transaction on TOMCCAP

· IEEE Transaction on Image Processing

· IEEE Transaction on Circuits and Systems for Video Technology

· IEEE Transaction on Vehicular Technology

Journal Special Issue Editorship

· Guest Editor, Computer Vision and Image Understanding (CVIU), special issue on “Large Scale Mul-timedia Semantic Indexing”, 2013.

· Guest Editor, ACM Transactions on Multimedia Computing, Communications and Applications (TOM-CCAP), special issue on “Best Papers of ACM Multimedia”, 2013.

· Guest Editor, IEEE Multimedia Magazine, specical issue on “Large Scale Geosocial Multimedia”, 2014.

Conference Reviewer/ Program Committee :

· ICCV, NIPS, CVPR, ACM MM, AISTAT, ICLR, ECCV and etc.

CONFERENCE TUTORIALS

Tutorial in ACM Multimedia Oct. 2013, Barcelona, Spain

· Massive-Scale Multimedia Semantic Modeling(Co-present with John R. Smith)

Tutorial in CVPR June 2014, Columbus, OH

· Learning Visual Semantics: Models, Massive Computation and Applications(Co-present with Shih-Fu Chang, John R. Smith and Rogerio Feris)

4