seeing is correcting:linked open data for portuguese

29
+ Seeing is Correcting: Linked Open Data for Portuguese Valeria de Paiva (joint work with Fabricio Chalub, Alexandre Rademaker, Livy Real, Claudia Freitas) ACL-LDL, Beijing, August 2015, Nuance Dec 2015

Upload: valeria-de-paiva

Post on 22-Mar-2017

275 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Seeing is Correcting:Linked Open Data for Portuguese

+

Seeing is Correcting: Linked Open Data for Portuguese

Valeria de Paiva (joint work with Fabricio Chalub, Alexandre Rademaker, Livy Real, Claudia Freitas) ACL-LDL, Beijing, August 2015, Nuance Dec 2015

Page 2: Seeing is Correcting:Linked Open Data for Portuguese

+

Page 3: Seeing is Correcting:Linked Open Data for Portuguese

+Linguistic Linked Open Data?

Page 4: Seeing is Correcting:Linked Open Data for Portuguese

+Linguistic Linked Open Data?

n  Lexical databases form an essential component of many modern Natural Language Processing (NLP) systems

n  Portuguese lacks many of the resources available in other languages, for example annotated collections of nominalizations, deverbals or deadjectivals

n  We’re developing OpenWordNet-PT, an open-source, RDF-based, version of WordNet for Portuguese.

Page 5: Seeing is Correcting:Linked Open Data for Portuguese

+Linguistic Resources

n very easy to start

n very hard to improve

n extremely difficult to maintain

Last two tasks do not get much in the way of recognition and kudos that the first task gets

Page 6: Seeing is Correcting:Linked Open Data for Portuguese

+OpenWordNet-PT

n  Freely available since Dec 2011

n  RDF based from the beginning

n  SPARQL endpoint and RDF downloads

n  constantly improved since 2011, manually and semi-automatically

n  Close close connection with Princeton WordNet PWN

Page 7: Seeing is Correcting:Linked Open Data for Portuguese

+Lexical Enrichment Process

n  Translations: glossaries and lists produced for other languages, such as English, French and Spanish are used, automatically translated and manually revised.

n  Corpora data contributes words or phrases in common use

n  Corpus brings additional challenges: there are expressions with no synset in the English wordnet. How to decide on these?

n  Dictionaries or lexical lists like Portal can be useful, to check completeness. But frequency/popularity mostly missing

Page 8: Seeing is Correcting:Linked Open Data for Portuguese

+OpenWordnet-PT: Goal

Not a simple translation of the original wordnet, but create a wordnet for Portuguese based on Princeton’s architecture and, as much as possible, linked to it at the level of the synsets.

Page 9: Seeing is Correcting:Linked Open Data for Portuguese

+Are we there yet?

n  We’re nowhere as comprehensive as Princeton’s wordnet. Or the Finnish or the Thai wordnets

n  Not too small either. More than twice the size of the Russian wordnet, bigger than the Spanish and just a little smaller than the French wordnet.

n  Quality of these resources is much harder to compare.

n  BUT…

Page 10: Seeing is Correcting:Linked Open Data for Portuguese

+Web Interface

Page 11: Seeing is Correcting:Linked Open Data for Portuguese

+More Web interface

Page 12: Seeing is Correcting:Linked Open Data for Portuguese

+Affordances of Linked Data

Page 13: Seeing is Correcting:Linked Open Data for Portuguese

+Linked Data

Page 14: Seeing is Correcting:Linked Open Data for Portuguese

+More linked data?

Page 15: Seeing is Correcting:Linked Open Data for Portuguese

+OpenWordNet-PT: challenges

n  Variants of Portuguese to include?

n  Register of the lexicon, colloquialisms, coarse language?

n  Princeton concepts with no direct correspondent in Portuguese? “jog”

n  How to include existing senses in Portuguese that do not exist in English? “aportuguesar” not EN, “africanizar” not PWN, but Merriam-Webster

Page 16: Seeing is Correcting:Linked Open Data for Portuguese

+OpenWordNet-PT: More challenges n  PWN is very US-centred.

Localize it?

n  synset 13390244-n a specific word (quarter) for “a United States or Canadian coin worth one fourth of a dollar”. Should we have synsets like this?

n  Or like synset 08139795-n the “United States Department of the Treasury” ?

n  A-Boxes should be in the ontology, not in the lexicon?demonyms (nouns and adjectives) need to be in the lexicon

Page 17: Seeing is Correcting:Linked Open Data for Portuguese

+First answers

n  Keep all the Princeton synsets.

n  Translate to the closer hyponym in Portuguese.

n  keep the synsets, the gloss and the examples, “moeda de 25 centavos” 25 cents coin and “Ministerio da Fazenda” department of treasury.

n  how to create new senses and to be sure that pre-existent wordnets have not created the same senses before us?

n  how to be sure that next wordnets will have access to those senses that are present in Portuguese, but not in English?

n  How linked data can help with that?

Page 18: Seeing is Correcting:Linked Open Data for Portuguese

+Current Status OWN-PT

n   43,925 synsets of which

n  adjectives: 6012 adverbs: 1054 nouns: 33245 verbs: 5902

n  Improvements from July 2015:

n  Total: 50459

n  New adjectives: 537 adverbs: 75 nouns: 550 verbs: 1227

Page 19: Seeing is Correcting:Linked Open Data for Portuguese

+RDF Representation n  Based on “RDF/OWL Representation

of WordNet” W3C Working Draft 19 June 2006,

n  Mapping extended with classes and relations for Wordnet 3.0 encoding and OWN-PT demands.

n  lemon-rdf encoded files at http://compling.hss.ntu.edu.sg/omw/.

n  Common Lisp code available at https://github.com/own-pt/wordnet2rdf, requires AllegroGraph triplestore and Allegro Common Lisp from Franz.

VS.

Page 20: Seeing is Correcting:Linked Open Data for Portuguese

+RDF Representation

n  RDF to formally specify the properties and classes that we use to model the data.

n  Suggested properties and classes to represent the extensions to the original WordNet to embed a lexicon of nominalizations, NomLex-PT.

n  keep track of the evolution of OpenWN-PT using the provenance PROV data model and make it available in RDF together with the openWN- PT RDF itself.

n  Back in 2012 nobody was using RDF for Wordnet distribution, mainly XML-LMF.

n   The use of URIs allows the easy reuse of entities produced by different researchers and groups.

n  http://w3id.org/own-pt/ common prefix for our datasets and definitions.

n  Stable url service from W3C.

n  ︎The urls like http://w3id.org/own-pt/wn30-pt/ instances/synset-00001740-n are dereferenceable. They current link to our triple store.

Page 21: Seeing is Correcting:Linked Open Data for Portuguese

+Web Interface Motivation

n  Correcting and improving linguistic data is a hard task.

n  No established criteria for accuracy of new wordnets.

n  Having many eyes over the resource is a main advantage.

n  Volunteer curated content needs adaptation to work for lexical resources.

n  Before expanding we are trying to fix the mistakes.

Page 22: Seeing is Correcting:Linked Open Data for Portuguese

+Web Interface Code

n  https://github.com/own-pt/cl-wnbrowser. All projects are now under own-pt organization.

n  Implemented with Common Lisp, Solr/Cloudant, NodeJS running in IBM BlueMix platform.

n  RESTful API and User Interface.

n  Links to other resources.

n  ︎Faceted search for activites and synsets.

Page 23: Seeing is Correcting:Linked Open Data for Portuguese

+Web Interface: voting n  voting mechanism, vaguely inspired by Reddit.

n  Contributors can submit suggestions and vote on already

n  submitted suggestions.

n  ︎  Suggestions can be to remove or add information words, glosses or examples.

n  ︎  Three votes make the suggestion eligible to be commited. MAKE IT TWO!

n  ︎  Support the achievement of consensus in the manual revisions. Hard problem!

n  ︎  Never delete suggestions, even the rejected ones. Keep track of the provenance of all changes in the data.

n  ︎  Comments may also contain hash tags and at-mentions to other users.

Page 24: Seeing is Correcting:Linked Open Data for Portuguese

+ Conclusions n  Introduced OpenWordNet-PT, an open Word-

Net for (Brazilian) Portuguese.

n  Recent improvements include a social interface for suggestions and voting.

n  The resource has been used in developing a high-throughput commercial system as well as in a cultural heritage project. And is used by FreeLing, BabelNet and Google Translate.

n  We anticipate further applications.

n  The data is freely available from GitHub , including a SPARQL endpoint

n  Browsing via http://wnpt.brlcloud.com/wn/ OR via Open Multilingual Wordnet is fun

Page 25: Seeing is Correcting:Linked Open Data for Portuguese

+Future Work n  Corpus based expansion and

revision

n  Portuguese glosses

n  Usual criticisms of WordNet include sparseness of links and fined grained senses

Hoping to use data from Morphosemantic Database to improve sparseness of connections (e.g. submission). Considering previous ML methods for clustering?

n  Google tags and Universal Dependencies

n  Connection to Knowledge graphs

Page 26: Seeing is Correcting:Linked Open Data for Portuguese

+

Thanks!

Page 27: Seeing is Correcting:Linked Open Data for Portuguese

+References Revisiting a Brazilian Wordnet. Valeria de Paiva, Alexandre Rademaker,  (2012) Proceedings of Global Wordnet Conference, Global Wordnet Association, Matsue. OpenWordNet-PT: An Open Brazilian WordNet For Reasoning. de Paiva, Valeria, Alexandre Rademaker, and Gerard de Melo. In Proceedings of the 24th International Conference On Computational Linguistics. http://hdl.handle.net/10438/10274. OpenWordNet-PT: A Project Report. Alexandre Rademaker, Valeria de Paiva, Gerard de Melo, Livy Real and Maira Gatti. Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia. Global Wordnet Association, 2014. Embedding NomLex-BR Nominalizations Into OpenWordnet-PT. Coelho, Livy Maria Real, Alexandre Rademaker, Valeria De Paiva, and Gerard de Melo. 2014. In Proceedings of the 7th Global WordNet Conference. Tartu, Estonia

Page 28: Seeing is Correcting:Linked Open Data for Portuguese

+ OpenWN-PT: next steps?..

n  (Still!) Finish translating the “core” synsets in the Princeton WordNet to Portuguese.

n  Increase number of relations in OpenWN-PT as a way of improving adequacy and coherence.

n  Adding Portuguese terms that satisfy different relations?

n  OpenVerbNet-PT?

n  Go back to the ontology building and relating...

Page 29: Seeing is Correcting:Linked Open Data for Portuguese

+Other stuff to add in?…

n  GoogleTags, UDs

n  BabelNet?

n  NER issues?

n  Temporal issues?

n  Work with Leonel?

n  Work on implicatives/factives in Portuguese?