mediaglobe - semantische analyse

28

Upload: harald-sack

Post on 16-Jan-2015

568 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mediaglobe - Semantische Analyse
Page 2: Mediaglobe - Semantische Analyse

Dipl.-Inf. Jörg WaitelonisHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Seman&sche  Analyse  und  Suche

Page 3: Mediaglobe - Semantische Analyse

Semantic Search Engine

Media Analysis‣Structural Video Analysis‣Intelligent Character Recognition‣Face Detection & Clustering ‣Audio Mining‣Visual Concept Detection

Semantic Analysis‣Named Entity Recognition‣Context Analysis‣Semantic Annotation

konzep&oneller  Workflow

Graphical User Interface‣Facetted Search‣Explorative Search‣fine granular User Annotation

Distribution / Production‣Media Asset Management

Digitization | Metadata | Rights

Page 4: Mediaglobe - Semantische Analyse

Warum  unbedingt  Seman&k???

Jaguar

Mehrdeutigkeiten durch Kontextbetrachtung auflösen

Die natürliche Sprache ist unfassbar ausdrucksstark UND mehrdeutig.

Page 5: Mediaglobe - Semantische Analyse

„Armstrong betrat als erster Mensch den Mond.“

‣Kontext im Text

‣z.B. aus ASR oder OCR

‣Kontext im Bild

‣z.B. aus Visual Concept Detection

Auf  den  Kontext  kommt  es  an.

Page 6: Mediaglobe - Semantische Analyse

Named  En&ty  Recogni&on

„Armstrong betrat als erster Mensch den Mond.“

Armstrong Mensch MondGeorge Armstrong Custer

Neil Armstrong

The Armstrong Twins

Armstrong, Florida

Armstrong, Ontario

Armstrong Automobile

Joe Armstrong

Armstrong County, Texass

Armstrong Gun

Craig Armstrong

Armstrong (Mondkrater)

Louis Armstrong

Armstrong Tunnel

Louis Armstrong International Airport

Armstrong‘s Theorem

Sir Thomas Armstrong

Ian Armstrong

HumanBill Mensch

Bob Mensch

David Mensch

Homer Mensch

Louise Mensch

Halber Mensch

Mensch ärgere Dich nichtMensch Computer

Peter van Mensch

Daniel Mensch

Mensch (album)

Der Mond (Oper)

MOND

Mond Nickel CompanyBrunner Mond

Bernard Mond

Peter Mond

Julian Mond

Ludwig Mond

Violet MondMOND Technologies

Robert Mond

Henry Mond

Alfred Mond

Chava Mond

Page 7: Mediaglobe - Semantische Analyse

Named  En&ty  Recogni&on

Wikipedia Info-Boxen

Page 8: Mediaglobe - Semantische Analyse

Wikipedia Info-Boxen

Die semantische Wikipedia

Named  En&ty  Recogni&on

Page 9: Mediaglobe - Semantische Analyse

http://dbpedia.org/

Named  En&ty  Recogni&on

Web of Data

Page 10: Mediaglobe - Semantische Analyse

Web of Data

Neil Armstrong Entities

Astronaut

Science Occupation

Employment

is a

is a

is a

Classes

Person

is a

has a

Named  En&ty  Recogni&on

Page 11: Mediaglobe - Semantische Analyse

Named  En&ty  Recogni&on

„Armstrong betrat als erster Mensch den Mond.“

Armstrong Mensch MondGeorge Armstrong Custer

Neil Armstrong

The Armstrong Twins

Armstrong, Florida

Armstrong, Ontario

Armstrong Automobile

Joe Armstrong

Armstrong County, Texass

Armstrong Gun

Craig Armstrong

Armstrong (Mondkrater)

Louis Armstrong

Armstrong Tunnel

Louis Armstrong International Airport

Armstrong‘s Theorem

Sir Thomas Armstrong

Ian Armstrong

HumanBill Mensch

Bob Mensch

David Mensch

Homer Mensch

Louise Mensch

Halber Mensch

Mensch ärgere Dich nichtMensch Computer

Peter van Mensch

Daniel Mensch

Mensch (album)

Der Mond (Oper)

MOND

Mond Nickel CompanyBrunner Mond

Bernard Mond

Peter Mond

Julian Mond

Ludwig Mond

Violet MondMOND Technologies

Robert Mond

Henry Mond

Alfred Mond

Chava Mond

Page 12: Mediaglobe - Semantische Analyse

Named  En&ty  Recogni&on

„Armstrong betrat als erster Mensch den Mond.“

Armstrong Mensch Mond

George Armstrong Custer

Neil Armstrong

Armstrong, Florida

Armstrong, Ontario

Armstrong Gun

Craig Armstrong

Armstrong (Mondkrater)

Louis Armstrong

Sir Thomas Armstrong

Human

Bob Mensch

David Mensch

Homer Mensch

Louise Mensch

Halber Mensch

Mensch ärgere Dich nichtMensch Computer

Mensch (album)

Der Mond (Oper)

Mond (Erdtrabant)

Mond Nickel Company

Brunner Mond

Bernard Mond

Peter Mond

Julian Mond

Ludwig Mond

Henry Mond

Alfred Mond

Chava Mond

Page 13: Mediaglobe - Semantische Analyse

Zeitabhängige  Seman&sche  Daten

time

Video Analysis /Metadata Extraction

metadatametadata

metadata

metadatametadata

e.g., bibliographical data,geographical data,encyclopedic data, ..

Entity Mapping

Entity Recognition

Page 14: Mediaglobe - Semantische Analyse

Kontext  Defini&on

RDF graph to find relations between entities co-occurringin a text maintaining the hypothesis that disambiguationof co-occurring elements in a text can be obtained byfinding connected elements in an RDF graph [7]. In orderto regard the special compilation of non-textual data, staticand user-genrated metadata in audio-visual content our novelapproach combines the use of semantic technologies andLinked Data with linguistic methods.

III. METHOD

According to a study about structure and characteristicsof folksonomy tags [8] an average of 83% of user-generatedtags are single terms. Also, an average of 82% of thereviewed tags are nouns. Based on these study results, weignore tag practices, such as camel case (”barackObama”)and treat tags as subjects or categories describing a resource.As a tag could also be part of a group of nouns representingan entity or a name (”flying machine”,”albert einstein”) thetags stored as single words without any given order have tobe combined in term groups of two or more terms to findall appropriate entities. Hence, every tag or group of tagswithin a given context may represent a distinct entity. Theterm combination process and subsequent mapping of termsand term groups to entities are described in sect. III-B.

To disambiguate ambiguous terms we combine two meth-ods: a co-occurences analysis of the terms in the context inWikipedia articles and an analysis of the page link graph ofthe Wikipedia articles of entity candidates. The scores forboth analysis steps are calculated to a total score.

A. Context Definition

Metadata exists in a certain context and has to be inter-preted according to this context. For tags of audio-visualcontent we identified two dimensions:

• temporal dimension• user-centered dimensionIn the temporal dimension a context can be defined as the

entire video, a segment or a single timestamp in the video.The user-centered dimension classifies a context by howmany users created the concerning metadata - only tags by acertain user or all tags regardless of which user. Fig. 1 showsthe combinations of the two dimensions of contexts formetadata in audio-visual content the interpretation regardingthe significance of a context.

Audio-visual content also provides the opportunity tosupply spatial information. Thus, tags in the same regionof a video frame are considered as related to each other.In the current approach we did not consider this contextdimension.

To describe our approach we use a sample context of ourtest set (see sect. IV). This sample context is composed oftags by only one user at a certain timestamp in the video.The video containing this sample context is a presentation

Figure 1. Dimensions of context definition in audio-visual content

by Dr. Garik Israelian at the TED conference3 entitled ”Howspectroscopy could reveal alien life”4. Our sample contextconsists of the tags ”hubble”, ”spitzer”, ”carbon”, ”dioxide”,”methan”, ”co2”, and ”water”.

B. Preprocessing

Term Combination: Our combination algorithm takesall tags of a specified spatio-temporal context (at a certaintimestamp/in a certain segment of a video, of a singleURL/image and generates every possible combination of atmost three terms of the context in every possible order. Inthat way we make sure to rectify groups of single termsthat belong together. We chose to generate combinationsof three words to make sure to also hit named entitiesconsisting of more than two words, such as ”public keycryptography” or ”alberto santos dumont”. About 90% ofthe DBpedia [9] labels consist of at most three words, butless than 5% consist of 4 words. Due to these numbersand performance issues we decided to limit the number ofterms to be combined to three. Subsequently in this paperby terms we will refer to single terms as well as generatedterm groups. The number c of combinations is calcultaed byc =

�jk=1

n!(n−k)! .

For our sample context containing 7 tags and at most3 terms in a combination (j = 3), 259 combinations aregenerated.

Term Mapping: The terms then have to be mapped tosemantic entities. For our approach we use entities of theLinked Open Data Cloud [10], in particular of the DBpedia,version 3.5.1.

DBpedia provides labels for the identification of distinctentities in 92 languages. We use English and German aswell as Finnish labels, as we noticed that neither English northe German labels contain important acronyms as labels, butthe Finnish language version does. As tagging users prefer tokeep it simple and short[2], resources dealing with ”DomainName System” would rather be tagged with ”DNS” than”Domain Name System”.

After simple string matching of the terms of the contextto DBpedia URIs, the URIs are revised for redirects and

3http://www.ted.com4http://yovisto.com/play/14415

User-centered Dimension

Temporal Dimension

Spatial Dimension

‣unterschiedliche Metadatenquellen haben unterschiedliche Zuverlässigkeit

‣autoritative Metadaten (strukturiert / unstrukturiert)

‣analytische Metadaten (zeit- / lagebezogen)

‣nichtautoritative nutzergenerierte Metadaten (global und zeit- bzw. lagebezogen))

Page 15: Mediaglobe - Semantische Analyse

En&täten-­‐basierte  Annota&on

‣räumlich und zeitliche Annotation mit semantischen Entitäten

Page 16: Mediaglobe - Semantische Analyse

En&täten-­‐basierte  Suche

Page 17: Mediaglobe - Semantische Analyse

FaceJerte  Suche

Page 18: Mediaglobe - Semantische Analyse

Link  And  Brush

Demo: http://mediaglobe.yovisto.com/mggui-dev2/

Page 19: Mediaglobe - Semantische Analyse

•Ein einfaches Beispiel:

Ich suche das Buch „Wem die Stunde schlägt“ von Ernest Hemingway in der ersten deutschen Ausgabe...

Suchen  ist  nicht  gleich  Suchen

Page 20: Mediaglobe - Semantische Analyse

•Ein einfaches Beispiel:

Ich suche das Buch „Wem die Stunde schlägt“ von Ernest Hemingway in der ersten deutschen Ausgabe...

Wem die Stunde schlägt. - Ernest H E M I N G W A Y. (Stockholm usw., Bermann-Fischer Verlag, 1941) 560 S. 8“II 1, 2506, 34548

Suchen  ist  nicht  gleich  Suchen

Page 21: Mediaglobe - Semantische Analyse

•...aber was, wenn man nicht genau weiß, was man sucht?

Mir hat das Buch „Wem die Stunde schlägt“ von Ernest Hemingway gefallen und ich weiß nicht genau, was ich als nächstes lesen soll....

Suchen  ist  nicht  gleich  Suchen

Page 22: Mediaglobe - Semantische Analyse

• Was, wenn der Benutzer nicht weiß, welchen Suchbegriff er/sie benutzen soll?

• Was, wenn der Benutzer komplexere Antworten sucht?

• Was, wenn er/sie das Wissensgebiet, über das er sich informieren will, nicht (gut) kennt?

• Was, wenn er/sie wissen möchte, welche Dokumente es insgesamt zu einem speziellen Thema in einem Repository gibt?

• ...,Stöbern‘ statt ,Suchen‘• ...etwas ,zufällig‘ finden• ...Serendipity• ...einen Überblick gewinnen

Explora&ve  Suche

Page 23: Mediaglobe - Semantische Analyse

dbpedia:For_Whom_the_Bell_Tolls

Wie soll das semantischeNetzwerk um dbpedia:For_Whom_the_Bell_Tollsherum durchsucht werden?

http://dbpedia.org/page/For_Whom_the_Bell_Tolls

Explora&ve  Suche

Page 24: Mediaglobe - Semantische Analyse

dbpedia-owl:author

dbpedia-owl:author

dbpedia-owl:author

dbpedia-owl:author

Explora&ve  Suche

Page 25: Mediaglobe - Semantische Analyse

dbpedia-owl:author

dbpedia:Ernest_Hemingwaydbpedia:For_Whom_the_Bell_Tolls

dbpedia:Raymond_Carver

dbpedia-

owl:influenced_by

dbpedia:Jack_Kerouac

dbpedia-

owl:influenced_by

dbpedia-owl:influenced_by

dbpedia:Jerome_D._Salinger

Explora&ve  Suche

Page 26: Mediaglobe - Semantische Analyse

dbpedia:Jack_Kerouac dbpedia:Raymond_Carverdbpedia:Jerome_D._Salinger

dbpedia-owl:notableWork dbpedia-owl:notableWork dbpedia-owl:notableWork

Explora&ve  Suche

Page 27: Mediaglobe - Semantische Analyse

FAZIT

‣Mediaglobe ermöglicht eine semantische Entitäten-basierte Suche

‣Mediaglobe schlägt damit traditionelle schlüsselwortbasierte Suchmaschinen in Genauigkeit und Trefferquote

‣Mediaglobes semantische Annotationen ermöglichen:

‣neuartige Empfehlungssysteme

‣z.B. als Erweiterung der Suchmöglichkeiten

‣oder als Grundlage für andere Content-sensitive Services

‣Interoperabilität zu anderen Systemen durch Standards

‣neue Gestaltungsmöglichkeiten innovativer User-Interfaces

Page 28: Mediaglobe - Semantische Analyse

Dipl.-Inf. Jörg WaitelonisHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Vielen  Dank  !