linguistic linked open data: what’s in for (deep) machine translation? christian chiarcos...

Post on 29-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Linguistic Linked Open Data: What’s in for (Deep) Machine Translation?

Christian Chiarcoschiarcos@informatik.uni-frankfurt.de

DeepMT, Sep 4th, 2015, Prague

Linked Open Data

Basic Concepts

3

Linked Open Data (LOD)

Plenty of Resources, linked with each other ;)Aug 2014

4

Linked Open Data (LOD)

• However, LOD pertains not so much to a resource (or a bundle of resources), but to a philosophy

• Best practices for publishing data on the web– Goals

• reusability• accessibility• transparent and explicit semantics

– esp. for links

Linked (Open) Data, informally

• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more

information from these sites• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data

http://www.w3.org/DesignIssues/LinkedData.html<Nr>

6

Linked Open Data: The 5 star plan

7

From Tables …

PHOnetics Information Base and LExicon (PHOIBLE) Moran, S. 2012. Using Linked Data to Create a Typological Knowledge Base. In Chiarcos, C., Nordhoff, S., and Hellmann, S. (eds), Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg.

8

From Tables to RDF …

Subject(primary key)

9

From Tables to RDF …

Subject

Property(„Relation“)

10

From Tables to RDF …

Subject

Property(„Relation“)

Object

11

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

12

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:glyph

13

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:glyph

We chose “hasSegment” for the property corresponding to column “glyph”

14

From Tables to RDF …

1. Decompose tables into triples, i.e.,– entity attribute value resp.– Subject Property Object

Subject

Property(„Relation“)

Object

tha u:hasSegment

We chose “hasSegment” for the property corresponding to column “glyph”

15

From Tables to RDF …

1. Decompose tables into triples2. Multiple triples constitute a graph

16

From Tables to RDF …

1. Decompose tables into triples2. Multiple triples constitute a graph3. A graph can aggregate triples from other sources, as well

17

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data)

18

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data)

RDFS, OWL

19

From Tables to RDF …

Graphs can be represented in other ways, but RDF allows us to

1. Provide explicit semantics (RDF Schema, Ontology)

2. Check consistency and infer implicit information

3. Merge (not only syntactically, but semantically)

4. Query

5. Link (enrich with external data) URIs & SPARQL

20

Uniform Resource Identifiers (URIs)

● Agree on a common vocabulary and names for entities● URIs provide globally unique identifiers

“hasSegment”

vs.

<http://mlode.nlp2rdf.org/resource/phoible/hasSegment>

vs.

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>

... phoible:hasSegment ...

string, not unambiguous

URIs

21

Turtle

• Simple triple notation

Subject-URI Property-URI Object-URI .Subject-URI Property-URI “Literal value” .

e.g., phoible:khm phoible:hasSegment "u:".

22

SPARQL

Merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language)

“the SQL of the Semantic Web”

SELECT DISTINCT ?languageWHERE {

?language phoible:hasSegment ?segment .?segment phoible:hasFeature phoible:delayed_release .

}

23

From Tables to RDF to Linked Data

• use URIs as names for things (1)– links to external URIs (links) allow us to retrieve more information

from these sites

• if they can be resolved via HTTP (2)• and provide information via SPARQL/RDF* (3)• and they include links to other URIs (4)Þ then, this is Linked Data

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.

Turtle notation

http://www.w3.org/DesignIssues/LinkedData.html

24

From Tables to RDF to Linked Data

@prefix phoible: <http://mlode.nlp2rdf.org/resource/phoible/>phoible:khm phoible:hasSegment "u:".phoible:khm owl:sameAs <http://lexvo.org/id/iso639-3/khm>.

Turtle notation

25

Linked Open Data: The 5 star plan

26

Linked Open Data (LOD, Aug 2014)

Linguistic Linked Open Data

Linguistic Linked Open Data (LLOD)

Linguistic MotivationsA brief History

28

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

29

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– terminology (GOLD, ISOcat, OLiA) (Farrar & Langendoen

2003, Ide & Wright 2004, Schmidt et al. 2006)

30

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– integrating typological data bases (TDS)

(Saulwick et al. 2005, Dimitriades et al. 2010)

31

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– modelling and querying multi-layer corpora

(Cassidy 2010, Chiarcos et al. 2008, Rehm et al. 2008)

32

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– NLP pipelines

(Buyko et al. 2008, Ribieira et al. 2012, Hellmann et al. 2013)

33

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution, e.g., for– interfacing corpus and dictionary data

(Burchardt et al. 2008, Mazziotta et al. 2010)

34

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

hence, several proposals to address them

• Independently, different groups considered using RDF/OWL as a (local) solution

• lexical resources long provided by the SW(Gangemi et al. 2003, Buitelaar et al. 2006)

35

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

• But these activities were not coordinated– and in particular, RDF was used, but resources rarely

linked to other resources in the web of data

36

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

• by the time, long been recognized as a problem

• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels

37

Language Resources, 2010 AD

• used in natural language processing, scientific research, language documentation, ...

• accessibility challenges– different formats, schemes– distributed– dispersed metadata collections

Þ require common specifications to represent, share, access and register language resourcesLinguistic Linked Open Data (LLOD) & LLOD cloud

Community Building

• by the time, long been recognized as a problem

• But these activities were not coordinatedÞ Interest in RDF, barely any links Þ need for establishing communication channels

38

OKFN Open Linguistics Working Group (OWLG)

• founded in Oct 2010 in Berlin, Germany– Working group of the Open Knowledge Foundation

• open network of individuals interested in– linguistic resources and/or – their publication under open licenses

• multi-disciplinary– NLP/CL, typology/language documentation, SW, …

• infrastructure – mailing list, web site/blog, wiki– http://linguistics.okfn.org

39

OWLG activities(http://linguistics.okfn.org)

– promoting open linguistic resourcesÞraising awareness, collecting metadata

(datahub.io)– facilitating wide-range community activities

• workshops, mailing list, publications– Linked Data in Linguistics (LDL)– Multilingual Linked Open Data for Enterprises (MLODE)– Linked Data in Linguistic Typology (LDLT)

40

OWLG activities(http://linguistics.okfn.org)

– promoting open linguistic resourcesÞraising awareness, collecting metadata

(datahub.io)– facilitating wide-range community activities

• workshops, mailing list, publications• facilitating exchange between and among more

specialized community groups– W3C OntoLex, BP-MLOD, LD4LT, ...– ACL SIGs (SIGLEX, SIGANN), ...– DGfS, MPI-EVA, ...

41

LLOD cloud

• a collection of linguistic resources– published under open licenses– as linked data– decentralized developed and maintained– meta data at http://datahub.io

=> cloud diagram

– developed as a community effort in the context of the Open Linguistics Working Group of the Open Knowledge Foundation

42

2011 a sketch on a table napkin

Mar 2012Chiarcos et al. (2012), LDL book

Sep 2012MLODE hackathon to produce first diagram from original (meta) data

2012-2014more data, more rigid quality constrantsemergence of related W3C Community Groups

Aug 2014top-level category in the LOD diagram

Workshop series

Linked Data in Linguistics(LDL, anually)

Multilingual Linked Open Data for Enterprises

(MLODE, bi-annually)

Linked Data in Linguistic Typology

(LDLT, 2013)

Building the LLOD Cloud

43linguistic-lod.org

44

Recent developments

• 9th International Conference on Language Resources and Evaluation (LREC-2014)– „the new hot topic in our community“

(Nicoletta Calzolari, Pres. ELRA)

• Selected LLOD events in the last 3 months– 4th Multilingual Semantic Web (Portoroz, Slovenia, June 2015)– 1st Summer Datathon on LLOD (Cercedilla, Spain, June 2015)– EUROLAN-2015 Summer School on “Linguistic Linked Open Data”

(Sibiu, Romania, July 2015)– LLOD-LSA Workshop at the Summer Institute of the Linguistic Society

of America (Chicago, IL, July 2015)– 4th Linked Data in Linguistics (Beijing, PRC, July 2015) – Ontology session at ESSLLI-2015 (Barcelona, Spain, Aug 2015)– 2nd Workshop on NLP&LOD (Hissar, Bulgaria, Sep 2015)

Linked Open Data for Linguists

Possible applications

46

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

• protocols and standards• links between data sets

Þ improved access to distributed resourcesÞ improved (re-)usability of language resourcesÞ improved visibility of language resources

47

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability

• comparable formats and protocols to access dataÞ use the same query language for different data sets

48

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability

• develop and (re-)use a shared vocabularies for equivalent concepts

Þ the same query on different data sets

49

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

50

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

SPARQL 1.1

51

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

Achievable with any graph-based data model

52

Linked Open Data for Linguists

• Linked Data– rules of best practice for publishing data on the web

Þ Information integration– Structural interoperability– Conceptual interoperability– Federation

• data published on the web– with a query interface (SPARQL end point)

Þ use a single query to query different datasets

The “killer application”, e.g., for annotated corpora

53

Conceptual Interoperability: Monolingual

• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)

Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP

54

Conceptual Interoperability: Monolingual

Susanne PennThe AT DTFulton NP1s NNPCounty NNL1cb NNPGrand JJ NNPJury NN1c NNPsaid VVDv VBDFriday NPD1 NNP

395 tagsword classes

morphological featuressyntactic features

lexical classes

57 tagsword classes

number and degree

55

Conceptual Interoperability: Monolingual

• Integrating both resources allows us to– apply more wide-scale statistical analyses– increase training data for supervised POS tagging– increase test data for unsupervised POS tagging

395 tagsword classes

morphological featuressyntactic features

lexical classes

57 tagsword classes

number and degree

56

Conceptual Interoperability: Multilingual

• with interoperable POS tags used across different languages, …– we can apply the same unlexicalized NLP tools

(e.g., parsers, cf. McDonald et al. 2013)– we can perform comparative corpus studies– we simplify multilingual annotation projection

57

Conceptual Interoperability

• Multiple terminology repositories exist– available over the web– RDF representation

• Are linked with each other– for language IDs: Glottolog & lexvo.org– for lexical senses: WordNets (ILI)– for grammatical categories & features: GOLD,

ISOcat, OLiA

58

Linked Terminologies

English

EAGLES

MULTEXT/East

15 (mostly) Eastern European languages

MULTEXT/East

MULTEXT/East 11 European languages

STTS

TIGER GermanConnexor

TüBa-D/ZGerman

PennBrown

Susanne

etc.

OLiAReference

Model

GOLD

ISOcat(morpho-syntax)

OntoTag(morpho-syntax)

TDS ontology

Ontologies of Linguistic Annotation OLiA

External Reference Models(Terminology Repositories)

(resource-specific) Annotation Models

Language Ressources

DictionariesCorpora

NLP Tools

EAGLESEAGLES

59

Conceptual Interoperability

PennThe DTFulton NNPCounty NNPGrand NNPJury NNPsaid VBDFriday NNP

Determiner PronounOrDeterminer

SusanneThe ATFulton NP1sCounty NNL1cbGrand JJJury NN1csaid VVDvFriday NPD1

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

ProperNoun Noun hasNumber.Singular

(MainVerb StrictAuxiliaryVerb) hasTense.Past [sic!]

DefiniteArticle ArticleDeterminer PronounOrDeterminer

Surname ProperNoun Noun hasNumber.Singular

TopographicalNoun ProperNoun Noun hasNumber.Singular

AdjectivehasDegree.Positive

CommonNoun Noun hasNumber.Singular

TemporalNoun ProperNoun Noun hasNumber.Singular

MainVerb hasTense.Past

atomic statements mostly identical, just a few more

from Susanne

What’s in for (Deep) MT?

Entity Linking: A Special Track for Proper NamesRe-Using Lexical Resources

Addressing Lexical GapsBootstrapping Dictionaries

Improved Deep Analysis

61

Deep MT (© Jan Hajic, yesterday)

62

Entity Linking: A Special Track for Proper Names

63

Translating Proper Names

• Normally not directly translated, but maintained– Differences in inflection– Different writing systems (Cyrillic vs. Latin vs.

Arabic, etc.)– SMT: make sure your Language Model doesn’t

override the Translation Model !

64

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !My grandfather's grandfather came to Germany in 1905.– "Il nonno di mio nonno arrivò in Canada nel 1905."

(google translate, Jul 2010)– "Nonno di mio nonno è venuto in Germania nel

1905." (google translate, Sep 2012)

65

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !"Recentemente", conferma Maria Serena Balestracci, "mi ha telefonato un signore da Bologna, che aveva sentito parlare del libro alla radio– "... a gentleman from London ..." (google

translate, Oct 2010)– "... a gentleman from Bologna ..." (google

translate, Sep 2012)

66

Translating Proper Names

• Make sure your Language Model doesn’t override the Translation Model !

These errors stopped after Google bought Freebase

67

Trivial Entity Linking

• Named Entity Recognition -> treat NEs in a special way

68

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels

69

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels

70

Trivial Entity Linking

• Named Entity Recognition • Entity Linking -> Link with an ontology, which

may provide multilingual labels– E.g., Entity Linking via DBpedia Spotlight– Follow DBpedia-JRCNames linking– Use multilingual label from JRCNames instead of

translating yourself

71

Problems with inflecting languages

• Just knowing about (one) possible label in another language doesn’t help much if you need to inflect a name – or, any other string label from a knowledge base

72

Problems with inflecting languages

• Listing all forms helps with entity linking, but not with machine translation

Þ We need linguistic LOD (LLOD)Systematic inclusion of grammatical information=> LLOD vocabularies and conventions

73

(Re-) Using lexical resources

74

Lemon: Lexicon Model for Ontologies

• Developed by the W3C Ontology-Lexica Community Group (OntoLex)

75

Lemon: Lexicon Model for Ontologies

• Developed by the W3C Ontology-Lexica Community Group (OntoLex)

• Provides a data model for adding linguistic information to ontologies

• Widely used within the LLOD cloud– Also by colleagues not participating in OntoLex

• E.g., PanLex (Long Now Foundation)

– “Abused” for any kind of lexical resource• Even beyond the original ontology lexicalization use

case

lemon Core

<Nr>

lemon Sample (Moran and Brümmer 2013)

<Nr>

Open World Assumption

• Unless explicitly stated, information is per se incomplete– Additional information can be expressed– E.g., using linguistic categories and features from

terminology repositoriesÞ Grammatical information can be described in a

reusable wayRecommended vocabularies: lexinfo, OLiA, GOLD

<Nr>

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts) <Nr>

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)

Providing (lexical) resources in

accordance with these vocabularies improves

their reusability<Nr>

Vocabularies for Lexical-Conceptual Resources

• lemon provides data structures, but – for content and metadata, it relies on external

vocabularies• Interoperability depends on a bundle of vocabularies

– WordNet, DBpedia, any ontology (lexical senses)– lexvo (language identifiers)– glottolog (languoid identifiers from linguistic typology)– PHOIBLE (phoneme inventories and phonological structures)– lexinfo (grammatical features for lexical resources)– OLiA (annotations)– ISOcat (resource metadata)– GOLD (grammatical concepts)

Providing (lexical) resources in

accordance with these vocabularies improves

their reusability

This effect can be extended to

NLP tools

<Nr>

82

Addressing lexical gaps

Addressing lexical gaps

• Subsumption inference can partially compensate the lack of lexical resources/coverageÞ If no counterpart for the target language is found,

try hypernyms

<Nr>

A gaggle of photographers …

• Idea: – translate something that doesn’t exist in the target

language• Assume you know that it refers to the English

WordNet term gaggle-n. What would it be in German?– http://wordnet-rdf.princeton.edu/ provides

multilingual labels

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

Basque, Finnish, Japanese, no German-> check hyperym

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

Port. branco, Span. banda, …no German -> check indirect hypernyms

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

French groupe, Cat. grup, Gal. grupo, … still no German -> check an external resource, say lemonUby(which we know to contain some German)

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Traverse different resources (in the same end point) according to their structure, until something is found …

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Possible because the structure of these resources is lemon-conformant

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

For resources out of the current end point, another SERVICE can be addressed -> federation

<Nr>

A gaggle of photographers …• http://wordnet-rdf.princeton.edu/wn31/gaggle-n

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")

wn31:gaggle-n lemon:sense/lemon:reference/wn:hypernym*/wn:synset_member/wn:translation ?gaggle-n-de FILTER (lang(?gaggle-n-de) = “de")wn31:gaggle-n …/wn:synset_member/owl:sameAs ?gaggleUbyFILTER regexp(str(?gaggleUby), “http://lemon-model.net/lexica/uby/wn/.*”).?gaggleUby …FILTER (lang(?gaggle-n-de) = “de")

Quite slow, though, but can be used to pre-compile word lists with generalisation

<Nr>

92

Bootstrapping dictionaries

Bootstrapping dictionaries

By transitivity• Goal: Translate from Czech to Farsi• No dictionary, but

– Czech-English, English-Farsi• Quite noisy, though, hence check multiple paths

– Czech-Russian, Russian-Farsi– …-> Limit to forms with high confidence (by the number of paths, alternatives, etc.)

• Slow, again, but can be used for precompiling<Nr>

94

Improved Deep Analysis

95

Improved Deep Analysis

• Re-using externally provided tools (and data)– Structural Interoperability

• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)

» If only one layer of analysis is considered» For more complicated annotations and actual corpora,

additional means are necessary, cf. POWLA (purl.org/powla)

– Conceptual Interoperability• Represent and integrate the output of NLP tools with

reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)

96

Improved Deep Analysis

• Re-using externally provided tools (and data)– Structural Interoperability

• The output of NLP tools can be represented in RDF– NLP Interchange Format (NIF, nlp2rdf.org)

» If only one layer of analysis is considered» For more complicated annotations and actual corpora,

additional means are necessary, cf. POWLA (purl.org/powla)

– Conceptual Interoperability• Represent and integrate the output of NLP tools with

reference to LLOD repositories, e.g., the Ontologies of Linguistic Annotation (OLiA)

98

Comparing and combining heterogeneous linguistic analyses

diese nicht neue Erkenntnisthis not new insight`this well-known insight‘

* P. Tapanainen and T. Järvinen. 1997. A nonprojective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64–71, Washington, DC, April 1997

** H. Schmid and F. Laws. 2008. Estimation of conditionalprobabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) , Manchester, UK, August 2008.

Connexor*PRON Dem FEM SG NOM

RFTagger**PRO.Dem.Attr.-3.Acc.Sg.Fem

99

Comparing and combining heterogeneous linguistic analyses

ConnexorPRON Dem FEM SG NOM

RFTaggerPRO.Dem.Attr.-3.Acc.Sg.Fem

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)

OLiA Reference Model descriptions

100

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Pronoun)rdf:type(olia:DemonstrativePronoun)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Nominative)

rdf:type(olia:PronounOrDeterminer)rdf:type(olia:Determiner)rdf:type(olia:DemonstrativeDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)olia:hasCase(olia:Accusative)

OLiA Reference Model descriptions

confidence ranking(simple voting)

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

predicted by both tools

predicted by one tool

101

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

confidence ranking(simple voting)

predicted by both tools

predicted by one tool

102

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

identify incompatible annotations

check consistency conditionsin the ontology

103

Comparing and combining heterogeneous linguistic analyses

olia:Determinerolia:Pronoun

olia:PronounOrDeterminer

olia_top:MorphosyntacticCategory

is-a

is-ais-a

olia:Demonstrative

Pronoun

olia:Demonstrative

Determiner

is-a is-a

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

siblings are inconsistent

cousins are inconsistent

rdf:type(olia:Determiner)rdf:type(olia:DemonstrativePronoun)

aunts/nieces, etc. are inconsistent

A is consistent with B iff A B or B A

104

Comparing and combining heterogeneous linguistic analyses

rdf:type(olia:PronounOrDeterminer)olia:hasNumber(olia:Singular)olia:hasGender(olia:Feminine)

rdf:type(olia:Pronoun)rdf:type(olia:Determiner)

rdf:type(olia:DemonstrativePronoun)rdf:type(olia:DemonstrativeDeterminer)

olia:hasCase(olia:Accusative)olia:hasCase(olia:Nominative)

disambiguation: create the maximal consistent set S of descriptions

consistency

1. S is empty2. process descriptions with decreasing confidence

a) if the current description is consistent with all descriptions in S, then add it to S

b) if not, skip itc) iterate until all descriptions are processed

Þ from every equally-ranked pair of inconsistent descriptions:

first come, first serve(simple voting with random tie resolution)

105

Experiments

• we know that ensemble combination improves accuracy– if so, we should observe an increase of accuracy

for (at least some) combinations of tools– but accuracy may be the wrong criterion

• inaccurate if the target annotation is less rich than one of the source annotations

Þ measure recall, not accuracy

106

Experiments (Chiarcos 2010)

• German newspaper corpora• 10,000 tokens from each of the following

newspaper corpora– NEGRA (Skut et al. 1998)– TIGER (Brants et al. 2002)– Potsdam Commentary Corpus (PCC, Stede 2004)

• TIGER/NEGRA-style target annotation

107

RFTagger TreeTaggerStanfordTagger

StanfordParser

BerkeleyParser

MorphistoMorphology

Connexor

corpus file with reference annotation

in TIGER format

plain texttokenized

RFTaggeranotation

model

STTS annotation

model

STTSannotation

model

STTSannotation

model

STTSannotation

model

Morphistoannotation

model

Connexorannotation

model

set of OLiA reference model

descriptions

TIGERannotation

model

maximalconsistentdescription

comparisonwith reference

description

108

ExperimentsMorphosyntax: Example

Diese nicht neue Erkenntnis

Þ PronounOrDeterminer& Determiner& DemonstrativeDeterminer

109

ExperimentsMorphosyntax: Recall

* StanfordTagger was trained on NEGRA

110

ExperimentsMorphosyntax: Recall

• continuous increase of (avg.) recall• combination of 5-6 tools outperforms best-

performing single tool– except StanfordTagger on NEGRA

• trained on NEGRA

Þ table for individual combinations

111

ExperimentsMorphosyntax: Results

• best-performing combinations (NEGRA)1. Stanford Tagger (98.97% recall)2. -“- + Stanford Parser (98.71% recall)3. -“- + TreeTagger (99.00% recall)4. Stanford Tagger + Stanford Parser + Morphisto + RFTagger

(98.87% recall)5. Stanford Tagger + Stanford Parser + TreeTagger + RFTagger +

Connexor (98.29% recall)Þ marginal decrease of performance of best-performing

tools

112

ExperimentsMorphosyntax: Results

• worst-performing combinations (NEGRA)(Berkeley Parser excluded)

1. Morphisto (70.06 % recall)2. -“- + Connexor (86.05 % recall)3. -“- + TreeTagger (91.90 % recall)4. -“- + RFTagger (94.29 % recall)5. -“- + StanfordTagger (96.10 % recall)

Þ rapid increase of performance for worst-performing tools

113

Findings

• result is a consistent set of ontological descriptions– no loss of detail when trained/evaluated against

a target annotation• with different granularityÞ can be evaluated against corpora with different target

annotation

– natural handling of different granularities• hierarchical structures

– string-based representation can be generated

114

More Experiments

Comparable results for German morphology (Chiarcos 2010, 3 tools)

Similar results for German dependency/edge labels(Chiarcos, unpublished, 5 tools, labels only)

Different use case, similar methodologyPareja et al. (2010), Spanish particle se

115

Even more experiments

• Instead of combining existing tools, we can also train tools directly on ontological representations of annotations– Even if these originate from different annotations

• With ontology-based pruning, these yield ontologically consistent descriptions

– Chiarcos & Sukhareva (NLP&LOD2, next week)• Trained a neural network, encoded and decoded with

OLiA representations• Increased granularity (depth of analysis), stable accuracy

– Replicate for discourse parsing• Based on Chiarcos (2014)

116

Summary

• LLOD – provides resources– facilitates interoperability

• data, tools, annotations, lexical resources

– facilitates access/integration of heterogeneous/distributed information

• is a community effort– depends on your input

117

Want to stay/get involved ?

• Join our discussions / meetings– present your resources, interests, questions, etc.

• Open Linguistics Working Group– http://linguistics.okfn.org/

• mailing list, telcos, meetings and events

– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications

118

Thanks a lot !

• Join our discussions / meetings– present your resources, interests, questions, etc.

• Open Linguistics Working Group– http://linguistics.okfn.org/

• mailing list, telcos, meetings and events

– also consider the relevant W3C CGs, e.g.,• OntoLex => lexical-conceptual resources• BP-MLOD => best practice guidelines• LD4LT => NLP applications

top related