juliane stiller europeanatech 2015

14
Automatic Solutions to Improve Multilingual Access in Europeana (There is a thin line between speaking 30 languages and gibberish) Juliane Stiller EuropeanaTech 2015, Paris, 12.02.2015

Upload: europeana

Post on 15-Jul-2015

799 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Automatic Solutions to Improve Multilingual Access in Europeana

(There is a thin line between speaking 30 languages and gibberish)

Juliane Stiller

EuropeanaTech 2015, Paris, 12.02.2015

Dimensions of multilinguality

à Interface and portal display à Search

•  Translation of query

•  Translation of documents

•  Enrichments with multilingual datasets

à Representation and refinement of search results •  User needs to be able to determine relevance of documents

à Browsing

Query Translation

The challenges of query translation

Determine source language

Determine target language

Translate and construct query

•  Queries are short •  39% of queries belong to more than

one language •  60% of queries are named entities

The detailed and long version of the process is in Péter Király: Query Translation in Europeana. Code4lib, Issue 27, http://journal.code4lib.org/articles/10285

Evaluation

à  Corpus with 250 aligned queries in English, French and German

à  Send to API to translate -> translations compared to baseline

Baseline From German to From English to From French to Stummfilm (DE) - Stummfilm Stummfilm

Silent film (EN) Silent film - Silent film

film muet (FR) cinéma muet cinéma muet -

Query translation – evaluation results

51%

19%

30%

English -> German

correct

incorrect

no translation

„Unarmed“ -> „unarmed – Best of 25th Anniversary“

„Bulgaria“ -> „Bulgarien (Begriffserklärung)“

For over 80% of the queries, the automatic solution is suitable

Automatic Enrichments

Europeana enrichment - example

Enrichment types and vocabularies

Enrichment Type

Target vocabulary

Source metadata fields

Number of enriched objects

Places GeoNames dcterms:spatial, dc:coverage

7 M

Concepts GEMET, DBpedia,

dc:subject, dc:type

9,2 M

Agents DBpedia dc:creator, dc:contributor

144,000

Time Semium Time

dc:date, dc:coverage, dcterms:temporal, edm:year

10,2 M

Semantically incorrect enrichment

Polen (Dutch) Polen (Basque)

Quality of enrichments

à Incorrect enrichments lead to •  Devaluation of curated metadata •  Loss of trust from providers •  Propagation of errors to different languages •  Irrelevant search results •  Bad user experiences

Better understanding of impact of enrichments through evaluation needed à

Wrong vs. correct enrichments

78%

100%

92%

22%

8%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

WHAT

WHO

WHERE

correct incorrect

Source: Stiller, J., Olensky, M., Petras, V.: A Framework for the Evaluation of Automatic Metadata Enrichments. MTSR 2014

To conclude

à  Multilingual access is a continuous undertaking that needs to be adapted constantly

à  Best practices need to be developed and fostered by Europeana

à  Community and network are indispensible for tackling multilingual issues

•  Task force on enrichment and evaluation with experts from the network just started their work

Thank you!

Juliane Stiller

[email protected]