cultmedia: deep learning for automatic description of images …€¦ · •miur, eu and italian...

42
CultMedia : Deep Learning for automatic description of images and video in DH Technological Innovation for Digital Humanities, March 3 rd 2018 Lorenzo Baraldi, Rita Cucchiara [email protected] University of Modena and Reggio Emilia

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

CultMedia: Deep Learning for automatic description of images and video in DH

Technological Innovation for Digital Humanities, March 3rd 2018

Lorenzo Baraldi, Rita Cucchiara

[email protected]

University of Modena and Reggio Emilia

Page 2: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

About us

Who

• 4 Staff people (Rita Cucchiara, Costantino Grana,

Roberto Vezzani, Simone Calderara)

• 8 Phd Students

• 7 Research assistants, SW developers

• 3 (ex) spinoff companies

Collaborations with

• Facebook FAIR (F), Eurecom (F)

• Panasonic (USA)

• Ferrari (I), Maserati (I)

• MIUR, EU and Italian public bodies

• Italian SuperComputing Resource Allocation – CINECA

• Many smes,

• Computer Vision Foundation, CVPL-IAPR, AIXIA

www.aimagelab.unimore.it

Aimage Lab UNIMORE and Ferrari spa

Page 3: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Research activity on Cultural Heritage

• Layout analysis and content classification on digitized manuscripts

• Browsing and retrieval systems

• Interaction with art

• Video, Vision and Language…teaching machines to understand Art

Page 4: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

The “Treccani” project

• 35 volumes, published from 1929

• Digitized version from the original manuscripts

• Complex layouts with regions from different categories:

• Text

• Images

• Graphic

• Scores

• Tables

• Borderless tables

• Goal: a completely digitized and browsable version of the Encyclopedia.

A. Corbelli, L. Baraldi, F. Balducci, C. Grana, R. Cucchiara "Layout analysis and content classification in digitized books" IRCDL 2017

A. Corbelli, L. Baraldi, C. Grana, R. Cucchiara "Historical Document Digitization through Layout Analysis and Deep Content Classification" ICPR 2016

Page 5: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

The “Treccani” project

Layout analysis

OCR on text regions

Region classification

• Text

• Images

• Graphic

• Formulas

• Scores

• Tables

• Bordlerless tables

• JSON output

• Interactive annotation interface

• Visualization interface

Page 6: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Ground truth Result

Page 7: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Ground truth Result

Page 8: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Annotation interface

The JSON description can be visualized and

edited in every part through an interactive

annotation interface.

Page 9: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Navigation interface

Homepage

Page 10: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Navigation interface

Single page visualization

Page 11: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Navigation interface

Digitized version with in-line graphic elements

Page 12: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Navigation interface

Automatically retrieved graphic elements

Page 13: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Navigation interface

Search by word

Page 14: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

The «Rerum Novarum» project

Document browsing and interactive retrieval

Multi-digitization of illuminated manuscripts

• Layout segmentation

• Picture segmentation and tagging

• Search with relevant feedback

D. Borghesani; C. Grana; R. Cucchiara "Rerum Novarum: Interactive Exploration of Illuminated Manuscripts" ACM MM 2010

C. Grana; D. Borghesani; R. Cucchiara "Relevance feedback strategies for artistic image collections tagging" ICMR 2011

Page 15: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Interacting with Art

• Novel human-machine interfaces: new kinds of self-guided tour that can integrate information from the local environment, web and social medias.

• A wearable vision device for museum environments.

• Visitors can interact with the artwork by replicating the gestures and behaviors that they would use to ask a guide something about the artwork.

Algorithms:

• Hand segmentation

• Gesture Recogntion

• Artwork Recognition

L. Baraldi, F. Paci, G. Serra, R. Cucchiara "Gesture Recognition using Wearable Vision Sensors to Enhance Visitors' Museum Experiences" IEEE Sensors, 2015

L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara "Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation" CVPRW 2014

Page 16: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Artwork recognition

• Image processing algorithm runs on the wearable device and it is able to detect, in real-time, the artwork the user is observing.

• The result of the processing activity is then sent to the processing center.

• The location service is used to speed up the artwork identification

Head-mountedCamera

WearableDevice

SmartBoxBluetooth

Access Point WiFi

Page 17: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

CultMedia: teaching machines to understand art

Project from the National Technological Cluster on Technologies for the Cultural Heritagecofounded by the Italian Ministry of Education, University and Research (2017-2018)

A focus on multimedia dataVideo, images, digitized documents, computer graphics

GoalsHigh quality and low cost multimedia production for re-using existing materials for integrating multimedia data in cross-media storytellings

Page 18: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

CultMedia

A disruptive improvement in the processes and services related to the cultural heritage content production

Goals

handling the creation of multimedia video and new transmedia storytelling

providing large cost savings through the extended use of machine learning and artificial intelligence solutions for the reuse of existing multimedia material and its integration in new CH productions.

Research activities @ AImageLab

Video browsing, indexing, retrieval

Novel descriptors for video indexing

Bridging together vision and language

… teaching machines to understand art!

Page 19: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Browsing (and reusing) video

Neuralstory: an interactive Multimedia System for Video Indexing and Re-use

• Decomposition of the storytelling structure into coherent parts, to enhance browsing and retrieval (scene detection)

• Automatic annotation and retrieval of broadcast video

• Users can produce new storytelling by means of multi-modal presentations (re-use)

Online demo at:https://www.neuralstory.it

Page 20: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Video Decomposition and Indexing

Video Decomposition into meaningful parts

• A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances.

• This decomposition is the basis of the visualization interface, and also allows a fine-grained search inside video-clips.

Retrieval

• Leverages automatic annotation and a thumbnail selection strategy, to provide semantically and aesthetically valuableresults.

Page 21: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Video decomposition: our approach

Perceptual features (visual, audio, quantity of speech) and Semantic features (textual concepts, visual concepts)

A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances

Page 22: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

A one-hour video is decomposed into coherent partsAnd can be watched in less than one minute

Page 23: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Retrieval: our approach

Hypothesis:

• In broadcast videos speaker describes what the video shows

• Retrieval driven by semantic concepts suggested in the transcript

Thumbnail selection

• Aesthetic ranking model using CNN activations, and a small training set

Page 24: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Aesthetic-based selection

[Baraldi, Grana, Cucchiara, ACM ICMR 2016]

Page 25: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Aesthetic-based retrieval

“Selecting and ranking thumbnails according with some learned perceptual features”

… The idea of beauty, comes from the perception of objects, their proportions, their harmony or unity among the

parts, in the evenness of the line and purity of color ……

low level characteristics, like color, edges and sharpness,

high level features, such as the presence of a clearly visible object in the center.

excellent match with the hierarchical nature of CNNs

a ranking strategy which learns the relative importance

given a dataset of user preferences.

VGG-16 > 4000 convolutional layers, different size

Page 26: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Aligning and searching inside videos

Temporal Match Kernels

A novel compact descriptor for video alignment and retrieval (with a Fourier transform!)

Applications

• Temporal alignment of different videos

• Similarity between videos

• Searching for a piece of video in a video collection

• Searching for an artwork in a video collection

With Facebook AI Research, CVPR 2018

L. Baraldi, M. Douze, R. Cucchiara and H. Jégou, "LAMV: Learning to align and match videos with kernelized temporal layers“, CVPR 2018

Page 27: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Video re-use

Images, shots and

scenes can be picked

during watching

Selected clips can be used to create new

multimodal slides

Which can be enriched with text

and images

Decomposing the storytelling structure of a collection of video enables the creation of new personalized storytelling.

Page 28: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

From temporal segmentation to captioning

LSTM networks as language models

At training time: condition on the image and train to predict the next word given the previous (GT) words

LSTM LSTM LSTM LSTM LSTM LSTM LSTMLSTM

a dog carrying

a

frisbee in a

fielda dog carrying a frisbee in aGT

Using a vocabulary of more than 10.000 words

- only at the first timestep- at every timestep

Page 29: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Automatic annotation

Automatically generated captions will be useful for human search, for automatic search by query, and for future query-answering services.

L. Baraldi, C. Grana, R. Cucchiara, "Hierarchical Boundary-Aware Neural Encoder for Video Captioning" CVPR, 2017

M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, "Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention" ACM TOMM, 2017

Generated caption: A woman is looking at a television screen.

Generated caption: A city with a large boat in the water.

Generated caption: A boat is in the water near a large mountain.

Generated caption: A woman in a red jacket is riding a bicycle.

Page 30: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Bridging vision and language in art

Vignette depicting Solomon receiving

homage from the princes.

A round with a peacock in fenced area.

Joseph dropped by the brothers in the

well.A round with two monkeys, one of

whom holds a cherub in his arms.

Goals

• Understanding art

• Describing art in natural language

• Retrieving images with natural language queries

Challenges:

• Open research area also in natural images

• Domain shift: visual and textual elements are different from ordinary datasets

Page 31: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

BibleVSA dataset

From the Borso d’Este Holy Bible:

Illuminated manuscript (640 pages)

Commentary describing the visual content of each of the illustrations, the decorations of the page, and of the textual content itself.

Annotations of the alignment between parts of the commentary and the illustrations

Training of visual-semantic embeddings

Automatic alignment of visual and textual cultural data

Page 32: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Visualizing the domain shift

Resnet-152 VGG-19

FastText (Facebook AI Research) GloVeWord2Vec

Visual data

Textual data

Page 33: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Building visual-semantic spaces in the DH domain

The unsupervised way

Relying only on the supervision given by non-DH datasets

…a metric learning loss, and the constraint that the distributions of text and data should match (MMD)

L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018

Without MMD With MMD

Page 34: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Building visual-semantic spaces in the DH domain

Automatic alignment on a single page

L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018

Page 35: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

The CultMedia dataset

We need more data, to tackle more tasks!

Creation of a (medium-to-)large-scale datasetoriented to the Cultural Heritage domain and suitable for automatic understanding tasks, such as:

Artwork identification and retrieval(I can detect, locate and identify an artwork)

Automatic Artwork description and retrieval with natural language queries(I can describe an artwork, and retrieve similar ones from other natural language descriptions)

Detection of attributes and relationships inside the artwork(I can identify the people/objects represented in the artwork, and the relationships between them)

Visual grounding of descriptions(I can use that knowledge to ground and justify the descriptions)

Strong link with the re-use spirit of the project.

«A man with hat holding a glass of wine»

Soldato con Calice (N. Tournier)

Page 36: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Data annotation (#1)

Temporal segmentation of the input videoto isolate the temporal extent of each artwork and unrelated temporal segments

Artwork detectioni.e. annotate the bounding box of the artwork, frame by frame (exploiting the semi-automatic annotation given by the optical flow)

Unrelated Artwork #1 Walking Artwork #2 ….

Page 37: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Data annotation (#2)

Annotation with metadatai.e. author, name of the artwork, year, style, …

Captions a. describing the content of the artwork without leveraging any cultural backgroundb. describing the content and the context of the artwork by leveraging a specific cultural background

«A man with hat holding a glass of wine»

«A caravaggesque painting in which a soldier seems to establish a

cultured dialogue with the spectator, descending, into the daily life of an

inn, echoes of the classical tradition of the myth of Bacchus »

Page 38: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Data annotation (#3)

Annotation of the detailsdetection and description of the components of the artwork (objects and people) with actions and attributes

Grounding of captionsi.e. connecting people, objects, attribute and actions in a natural language sentence“Nerone is standing in front of a Agrippina, who lays on a bed.”

Nerone (person), standing

Agrippina (person), laying

Page 39: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Annotation interface

Ad-hoc web-based annotation interfaces, also integrating existing platforms (Vatic)

Audit and control of the annotations in an on-line manner.

Page 40: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Preliminary results

A first round of data collection and annotationhas taken place on February

Three groups of five annotators

Each round is scheduled as follows:

1st day: training on the interface, collection and validation of sample annotations on synthetic data

2nd day: visit to the Estense Gallery (Modena), and collection of the data- before and after the

3rd day 5th day: annotation and cross-validation

During the first round:

around 2000 natural language descriptions

140 detailed annotations of artworks and their details

annotation of 200 short user-generated videos taken inside the museum.

Page 41: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Research activity on Cultural Heritage

• Layout analysis and content classification on digitized manuscripts

• Browsing and retrieval systems

• Interaction with art

• Video, Vision and Language…teaching machines to understand Art

Page 42: CultMedia: Deep Learning for automatic description of images …€¦ · •MIUR, EU and Italian public bodies ... FastText (FacebookAI Research) Word2Vec GloVe Visual data Textual

Thank you!Questions?

[email protected]@unimore.it

http://aimagelab.ing.unimore.it

Thanks to “Città Educante” Project (CTN01 00034 393801) of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).

Ongoing collaboration with Facebook AI Research (FAIR)Facebook has selected Imagelab as one of the 15 world-class research labs in Europe

Thanks to “CultMedia” Project of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).