information extraction in the talkofeurope creative camp
TRANSCRIPT
Computational Support in eHumanitiesProof of concept produced during
CLARIN’s Creative CampTalk of Europe
Wim PetersAdam Funk
University of Sheffield, UK
[email protected]@sheffield.ac.uk
CLARIN’s Creative CampTalk of Europe
• Our main aim in this event:
• Term identification and structuringin ToE and UK Parliament data
• Linking ToE and UK Parliament terminology
• Automatic enrichment of ToE data set
• http://linkedpolitics.ops.few.vu.nl/home
Data set 1
• Talk of Europe data set
• Plenary debates of the European Parliament as Linked Open Data
• http://linkedpolitics.ops.few.vu.nl/
ParlParse
• Speeches scraped from UK Parliamentary web site
• Converted in to structured XML representations
Output
• For terms in each data set:
– Terms
– Term hierarchies
– Term clusters
– Sententence-based sentiment context
• Between data sets:
– Term relatedness between terms
• To identify and extract relevant information from the source material, we use the GATE architecture for the production of semantic metadata in the form of text annotations.
• GATE is a framework for language engineering applications, which supports efficient and robust text processing including functionality for both manual and automatic annotation.
• It is highly scalable and has been applied in many large text processing projects;
• It is an open source desktop application written in Java that provides a user interface for professional linguists and text engineers to bring together a wide variety of natural language processing tools and apply them to a set of documents.
General Architecture for Text Engineering
• General Architecture for Text Engineering (GATE)
• open source framework which
supports plug-in NLP components
to process a corpus of text.
http://gate.ac.uk/
Free system download and training courses
LEX 2014, Ravenna, Italy
General Architecture for Text Engineering
Advantages
• Reproducibility
• Reusability
• Flexibility
• Customisability to scholarly requirements regarding research questions and analysis methodology
• http://www.gate.ac.uk
Term Extraction
• TermRaider• http://www.dcs.shef.ac.uk/~wim/termraider.html • automatically provides domain-specific noun phrase
term candidates from a text corpus together with a statistically derived termhood score.
• Possible terms are filtered by means of a multi-word-unit grammar that defines the possible sequences of part of speech tags constituting noun phrases.
• It computes various termhood scores such as Kyoto Domain Relevance and frequency/inverted document frequency (TF/IDF). The scores indicate the salience of each term candidate for each document in the corpus.
KYOTO domain relevance score
• df* (1+nh)
– df: number of documents in the current corpora containing the term
– nf: number of hyponymic term candidates
• W. Bosma and P. Vossen. Bootstrapping language-neutral term extraction. In 7th Language Resources and Evaluation Conference (LREC), Valletta, Malta (2010)
Term Relatedness 1: Hyponyms(rdf: skos:narrowerTransitive)
• Hierarchical relations between terms based on head phrase matching
• fight– fight against all form of intolerance
• fight– fight against serious crime and terrorism
• fight– fight against all form of intolerance and discrimination
• fight– fight against illegal drug and the organised crime
• fight– fight against corruption and organised crime
• control– efficient control
• efficient control of EU fund
Term relatedness 2: Clusters
• Compute Pointwise Mutual Information– Pair-wise association score for terms that co-occur
within a context window (in our case sentences)
Cluster creation
• Simple clique algorithm• https://en.wikipedia.org/wiki/Cluster_analysis
• Each cluster member (a term candidate with Kyoto Domain Relevance score of > 70/100 is connected to all other cluster members by means of a PMI score > 70/100
– Result: “statistical thesaurus”
– strongly associated groups of words
– Use enhance data exploration by expanding searches with related terms (query expansion)
Clusters including “human rights” ToE data
(manually highlighted elements indicative of contrast with UK perspective)
• \end\vote\commission\network\programme\funding\proposal\report\text\level\service\freedom\fund\concern\president\access\basis\internet\enforcement\example\instrument\plastic\money\EU policy
• \recommendation\position\level\change\community\right\part\approach\discussion\dossier\regard\opinion\policy\force\negotiation\account\public\opportunity\fight
Clusters including “human rights” UK data
(manually highlighted elements indicative of contrast with EU perspective)
• \foreign\press\answer\election
• \realise\MPs\politician\consequence\claim\interest\lesson\pension\employment
• \incentive\accountability\movement\treatment\word\young people\assessment\
Term Relatedness 3: Links between ToEand UK terms
(rdf: skos:related)
• For now the link is limited to orthographic overlap of terms’ canonical forms
– Lemmatised
– decapitalised
Sentiment Context for Terminology
• Sentences have a sentiment value of positive, negative or neutral
• This allows the exploration of the emotional load of the context in which terminology is used
Why RDF output?
• Standard knowledge representation
• Queryable in SPARQL
• Slots additional knowledge into the Talk of Europe data model
Coverage of results
• Proof of concept
• EuroParliament– 2 months (6546 speeches)– 7900 term candidates
• UK Parliament– 1 month (January 2014, 7571 UK speeches)– 28000 term candidates
• Around 750000 triples• 2900 relations between EU and UK terminology
Usability of data and methodology
• Assists further exploration of parliamentarians’ styles, priorities and perspectives through term usage and context
– E.g. compare cluster members of terms in order to detect contrastive perspectives between ToE and UK terminological use
– (see “human rights” example)
• Flexible methodology, re-usable on other data
Data
• RDF
• http://www.dcs.shef.ac.uk/~wim/TalkOfEurope-Gate-Terms.zip
• Owl model
• http://www.dcs.shef.ac.uk/~wim/toe-data-model.owl