information extraction in the talkofeurope creative camp

Computational Support in eHumanitiesProof of concept produced during

CLARIN’s Creative CampTalk of Europe

Wim PetersAdam Funk

University of Sheffield, UK

[email protected]@sheffield.ac.uk

CLARIN’s Creative CampTalk of Europe

• Our main aim in this event:

• Term identification and structuringin ToE and UK Parliament data

• Linking ToE and UK Parliament terminology

• Automatic enrichment of ToE data set

• http://linkedpolitics.ops.few.vu.nl/home

Data set 1

• Talk of Europe data set

• Plenary debates of the European Parliament as Linked Open Data

• http://linkedpolitics.ops.few.vu.nl/

Data set 2

• UK Parliamentary Archives

UK Parliamentary Archiveshttp://www.parliament.uk/business/publications/parliamentary-archives/

ParlParse

• Speeches scraped from UK Parliamentary web site

• Converted in to structured XML representations

http://parser.theyworkforyou.com/

Workflow

Output

• For terms in each data set:

– Terms

– Term hierarchies

– Term clusters

– Sententence-based sentiment context

• Between data sets:

– Term relatedness between terms

• To identify and extract relevant information from the source material, we use the GATE architecture for the production of semantic metadata in the form of text annotations.

• GATE is a framework for language engineering applications, which supports efficient and robust text processing including functionality for both manual and automatic annotation.

• It is highly scalable and has been applied in many large text processing projects;

• It is an open source desktop application written in Java that provides a user interface for professional linguists and text engineers to bring together a wide variety of natural language processing tools and apply them to a set of documents.

General Architecture for Text Engineering

• General Architecture for Text Engineering (GATE)

• open source framework which

supports plug-in NLP components

to process a corpus of text.

http://gate.ac.uk/

Free system download and training courses

LEX 2014, Ravenna, Italy

General Architecture for Text Engineering

http://gate.ac.uk/

Advantages

• Reproducibility

• Reusability

• Flexibility

• Customisability to scholarly requirements regarding research questions and analysis methodology

• http://www.gate.ac.uk

https://www.gate.ac.uk/

Text Annotations

Term Extraction

• TermRaider• http://www.dcs.shef.ac.uk/~wim/termraider.html • automatically provides domain-specific noun phrase

term candidates from a text corpus together with a statistically derived termhood score.

• Possible terms are filtered by means of a multi-word-unit grammar that defines the possible sequences of part of speech tags constituting noun phrases.

• It computes various termhood scores such as Kyoto Domain Relevance and frequency/inverted document frequency (TF/IDF). The scores indicate the salience of each term candidate for each document in the corpus.

KYOTO domain relevance score

• df* (1+nh)

– df: number of documents in the current corpora containing the term

– nf: number of hyponymic term candidates

• W. Bosma and P. Vossen. Bootstrapping language-neutral term extraction. In 7th Language Resources and Evaluation Conference (LREC), Valletta, Malta (2010)

Tf-Idf(WikiPedia)

Term Relatedness 1: Hyponyms(rdf: skos:narrowerTransitive)

• Hierarchical relations between terms based on head phrase matching

• fight– fight against all form of intolerance

• fight– fight against serious crime and terrorism

• fight– fight against all form of intolerance and discrimination

• fight– fight against illegal drug and the organised crime

• fight– fight against corruption and organised crime

• control– efficient control

• efficient control of EU fund

Term relatedness 2: Clusters

• Compute Pointwise Mutual Information– Pair-wise association score for terms that co-occur

within a context window (in our case sentences)

Cluster creation

• Simple clique algorithm• https://en.wikipedia.org/wiki/Cluster_analysis

• Each cluster member (a term candidate with Kyoto Domain Relevance score of > 70/100 is connected to all other cluster members by means of a PMI score > 70/100

– Result: “statistical thesaurus”

– strongly associated groups of words

– Use enhance data exploration by expanding searches with related terms (query expansion)

https://en.wikipedia.org/wiki/Cluster_analysis

Clusters including “human rights” ToE data

(manually highlighted elements indicative of contrast with UK perspective)

• \end\vote\commission\network\programme\funding\proposal\report\text\level\service\freedom\fund\concern\president\access\basis\internet\enforcement\example\instrument\plastic\money\EU policy

• \recommendation\position\level\change\community\right\part\approach\discussion\dossier\regard\opinion\policy\force\negotiation\account\public\opportunity\fight

Clusters including “human rights” UK data

(manually highlighted elements indicative of contrast with EU perspective)

• \foreign\press\answer\election

• \realise\MPs\politician\consequence\claim\interest\lesson\pension\employment

• \incentive\accountability\movement\treatment\word\young people\assessment\

Term Relatedness 3: Links between ToEand UK terms

(rdf: skos:related)

• For now the link is limited to orthographic overlap of terms’ canonical forms

– Lemmatised

– decapitalised

Sentiment Context for Terminology

• Sentences have a sentiment value of positive, negative or neutral

• This allows the exploration of the emotional load of the context in which terminology is used

Added RDF

Why RDF output?

• Standard knowledge representation

• Queryable in SPARQL

• Slots additional knowledge into the Talk of Europe data model

Coverage of results

• Proof of concept

• EuroParliament– 2 months (6546 speeches)– 7900 term candidates

• UK Parliament– 1 month (January 2014, 7571 UK speeches)– 28000 term candidates

• Around 750000 triples• 2900 relations between EU and UK terminology

Usability of data and methodology

• Assists further exploration of parliamentarians’ styles, priorities and perspectives through term usage and context

– E.g. compare cluster members of terms in order to detect contrastive perspectives between ToE and UK terminological use

– (see “human rights” example)

• Flexible methodology, re-usable on other data

Data

• RDF

• http://www.dcs.shef.ac.uk/~wim/TalkOfEurope-Gate-Terms.zip

• Owl model

• http://www.dcs.shef.ac.uk/~wim/toe-data-model.owl

http://www.dcs.shef.ac.uk/~wim/TalkOfEurope-Gate-Terms.zip

http://www.dcs.shef.ac.uk/~wim/toe-data-model.owl

information extraction in the talkofeurope creative camp

Data & Analytics

corpus of text

text corpus

text engineers

robust text processing

form of text annotations

term identification

data sets

nlhome data set