biographyned escience center 21 march 2013. why a good case for escience? involves big data with...

60
BiographyNed eScience Center 21 March 2013

Upload: brent-ross

Post on 14-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

BiographyNed

eScience Center 21 March 2013

Page 2: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual
Page 3: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Why a good case for eScience?

• Involves big data with high complexity• Rich meta data joining diverse textual sources and

selections of data• Incomplete and noisy• Potential to investigate difficult questions, e.g.:– How did the current Dutch elite develop from the

colonial past?• Biographies may represent different views and

realities and thus answers to questions:– hero or villain– 2.8 textual sources per person

Page 4: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

What will we do?

• Develop generic text mining technology that converts textual data to structured data– Taking into account nature of historical text

• Enrich and externally link data repository of Dutch biographies

• Develop visualizations and interactions on the data set to support historical research

• Develop a range of cases that demonstrate the possibilities and impossibilities of the data set and technology

Page 5: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Patterns in data ValueInterpretationLine composition in paintings

Twitter patterns during electionsCubism

Democratic participation

Nature of eHumanities

Page 6: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Patterns in data ValueInterpretation

Narratives

Cases: persons/objects/events

Line composition in paintingsTwitter patterns during elections

CubismDemocratic participation

19th-century Japanese printsBiographical descriptions of Prince Bernhard

The rise of the Japanese middle classGerman nobles in the Interbellum

Page 7: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Statistics on available information

Name

Category

Gender

Date of Death

Date of Birth

Place of Birth

Place of Death

Occupation

Religion

Father

Mother

Claim to Fame

Partner

Text

0 20 40 60 80 100 120

Individuals with available information (%)

percentage

Page 8: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Textual Information per person

Information Numbers

Average XML-files per individual 2.79

Texts 78.75%

Words (total/person) 288.83

Words (longest text/person) 229.04

Words (total/text) 366.76

Words (longest text)/texts 290.83

Page 9: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Availability of Information in the portal

Partner

Mother

Father

Claim 2 Fa

me

Religio

n

Occupati

on

Date of b

irth

Place o

f birt

h

Date of d

eath

Place o

f dea

th

Catego

ryNam

e0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Information AbsentText available

Page 10: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Presence of information for governors of Dutch Indies (% on 71 individuals)

mariag

e

multiple mari

age

partners

Children

(number)

Children

(nam

es)

Age (s

tart fu

nction)

Place o

f Birt

h

Place o

f Dea

th

Studies

Previous c

arree

r

Reaso

n job en

d

Last jo

b

Family

connecti

ons

Religio

n0

10

20

30

40

50

60

70

80

90

100

metadatatext

Page 11: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

The Historical Perspective

• History and Biography• Where do eScience and History meet?

• Use Cases

Page 12: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Historical Research

The Art and Science of History: Drawing up a narrative from primary and secondary sources which approximates historical reality as well as

possible.

Page 13: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Building Blocks and Concrete

• Building blocks: facts derived mainly from archival findings and existing literature

• Concrete: the methods historians use to put them together into a narrative/synthesis.

• The Narrative: a historical synthesis which can not be scientifically proven (only made likely) based on facts which can be proven or falsified. There is necessarily a creative element in drawing up a narrative

Page 14: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Example: Grand Pensionary Johan de Witt (1625-1672)

• Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewarded one of the instigators of the murder

• Concrete: (logic) Based on these last data itis likely that William ordered the death of Johan

• Narrative: William probably ordered the death of Johan <= proposition based on facts and reasoning

Page 15: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

The House of History

Page 16: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

The Importance of Provenance

The only way to falsify presented historical facts is by going back to the original source(s) and

look at those sources critically.

Highly important to be able to know what information comes from where exactly.

Page 17: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Our Sources Here

• The Metadata: building blocks

• The entries in biographical dictionaries themselves: short historical narratives

Page 18: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Status of Biography in Academia and Society

• Despite improved efforts this century to embed biography in academic theories and methods, some still do not consider it (e.g. some social historians) a worthy academic discipline, being too anecdotal and limited.

• Biography is the most popular non-fiction genre in bookstores (from both academic and lay authors)

Page 19: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Where do eScience and History meet? (I)

“And when the capsule biography of an individual is combined with 50,000 others, many of them relatively obscure, […] and when they are all powerfully searchable online, the social historian’s grumbles about biography’s limitations as an approach to historical study dissolves into nothingness.”

(Brian Harrison, 2004, former editor of the Oxford Dictionary of National Biography)

Page 20: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Where do eScience and History meet? (II)

A. Quantitative analyses of a larger group of people (prosopography).Surpassing the anecdotal.

B. Finding relations/networks between people which are otherwise hard to detect

Page 21: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Where do eScience and History meet? III

C. Insight in Historiography and historical selectivity. Who was described/included and why? “Undoubtedly I have deprived many interesting women by not including them. The only thing I can say to defend myself is this: history writing is also a process of ruthless selection.” (Els Kloek, Head Biography portal and main author 1001 vrouwen)

D. Thematic research. E.g.: When did the discovery of America start to influence people’s lives?

Page 22: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

BiographyNed Use Cases

In the initial stages of the research a list of

possible historical questions within one of those four themes was drawn up (subject to change) , which the demonstrator should be able to give us an answer to, or at least point

into a direction/trend.

Page 23: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Case I: Making life easier: Group portrait of the Governors-General

• Highest Official in the Dutch indies 1610-1949• 71 men (still a relatively small group)• What can we say about these men as a group?• Who was appointed and what qualities did he

have to have? • Etc ….

Page 24: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Case I: data mining

• Family connections (parents/wife/children, other relevant connections <= patronage)

• Place of Birth• Education • Religion• Career (patterns)• Age at appointment• Duration of holding the office• Reason for leaving the office• Place of Death

Page 25: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Case I: Time and Effort

More than 1 full week

to manually mine this information from the Biography Portal. Can a historian do this with

(almost) the same results in under one hour if helped by the demonstrator?

Page 26: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Case II: Making things possible: The Dutch Nation & Identity

• Who were selected to be included in National Biographical Dictionaries and why? (what was their claim to fame?)

• Are there different perspectives on the sameperson over the time and how can this be explained?

• Who was deemed most important? (based on the length of the entries)• What time periods are most represented?• Is there a difference in claim to fame for people from different

periods in history, or between men and women?• Which words are used most often and can we link them to

national identities?

Page 27: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Case II: More Questions …

• What events are mentioned most often and what does that say about the status questionis of how the Dutch see/saw themselves?

• What are the differences in the answers to these questions between several national biographical dictionaries?

• Are people and events described or appreciated differently over time? Does the perspective change?

• How does this relate to biographical dictionaries, nations and identities elsewhere in Europe?

Page 28: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Conversion to Linked Data

Page 29: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Online machine readable data with links • Simple facts called ‘RDF Triples’

Thorbecke > hasBirthPlace > Zwolle

Some technology concepts: • Schemas: To structure LD• RDF Stores: To store LD • SPARQL: To access LD

Huge growth in the past years: •More than 300 data sources•More than 30 billion triples

A crash course on Linked Data

Page 30: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Purely syntactic conversion• Preserve the original structure of the data• Prevent loss of information• Allow for reinterpretation of the original data in the future

The conversion process

Data Preservation

Page 31: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Conversion steps: • Retrieval of XML dump of the Biography Portal• Initial conversion to ‘crude’ RDF• Using ClioPatria and the XMLRDF

tool for ClioPatria• RDF restructuring• Linking to other sources• Essential step in the

‘Linked Data’ philosophy

The conversion process

Page 32: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Data schema: • Based on the structure of the original XML files• Needs to facilitate the coupling of different biographies of the same

person, without compromising the original data• Needs to facilitate the incorporation of several enrichments, following

from NLP, Entity Reconciliation, etc.• Compatible with existing

schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms

The conversion process

Page 33: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

BiograpyNedschema

Thorbecke

Biographical Description

ProvenanceMeta Data

NNBW

PersonMeta Data

“Thorbecke”

BiographyParts

Birth1798Event

Biographical Description

Enrichment NLP Tool

PersonMeta Data

EventBirth

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Zwolle1798-01-14

Page 34: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Retrieving Information from Text

Page 35: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

The texts in the Biography Portal

• Collection of biographical dictionaries• Dutch, including from the 19th and early 20th

century and even older quotes• Sources (different dictionaries/collections)

have their own style• Metadata available (though large differences

in completeness)

Page 36: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Challenges and Advantages

• Challenges:– Little work on NLP and biographies– Performance of Dutch NLP tools on variations of

Dutch• Advantages:– High quality metadata coverage several categories

of information (supervised machine learning)– Within sources, clear and similar structure of texts

Page 37: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

General Approach

• Start by using advantages:– Use metadata to label information– A basic IR system can be build using sentence

number and lemmas as features• Enhance performance with NLP tools• Build upon information retrieve in the first

steps to tackle more challenging tasks

Page 38: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

A Basic System

• Supervised Machine Learning• Two step identification process (Wu and Weld

2007;2010, Fader et al. 2011)– Identify sentence that contains information– Sequence tagging to identify information within

the sentence

Page 39: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Adding NLP

• Location & Date recognition (GeoNames)• (other) Named Entities (VIAF enhanced with

names from metadata)• Depending on performance of the system,

we’ll work on:– Chunking, multiword recognition– Parsing– Word Sense Disambiguation

Page 40: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Metadata & Project Goals

• Duplicate detection (metadata and text)• Events/Network discovery– Education (begin, end, location)– Occupation (begin, end, location)– Relations (parents, partners)

• Temporal relations between events

Page 41: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Output first system

• Better coverage of categories mentioned above

• A timeline for a person’s life (birth, education, occupation, locations, death)

• Named Entities in text (dates, locations, persons)

Page 42: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Beyond the first system

The information provided by the first system can be used to:

1. Identify alternative descriptions of events(same time, location and/or participants)

2. Identify relations between events(same locations & time, consequent events, same participants, etc.)

3. Initial networks of people

Page 43: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Methodological issues and text interpretation

• Results should be reproducible– Code release (including scripts, configurations, …)– Documentation– Open source data

• The setup should be modular– Combine output of different tools– Flexible choice of methods used

Page 44: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Evaluation Challenges (1/2)

• How to evaluate the extraction tools?• Partial evaluation using metadata (10-fold

cross-validation), but:1. No precise indication of precision or recall

(incomplete metadata…)2. Biographies with rich metadata are not

necessarily representative Manually annotated data needed!

Page 45: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Evaluation Challenges (2/2)

• How to compare performance NLP tools?– Little work on biographies, little or none on Dutch

ones…– How hard are older texts? Can we quantify?

Systematic comparison:• English biographies (wikipedia)• Dutch biographies (wikipedia)• Biographies from the portal

Page 46: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Reproducibility/Replication

• What do results mean if they cannot be reproduced?

• What variation in results can be expected based on details not mentioned in papers?

• Which information is needed to replicate results or find the origin of differences?

Paper submitted ACL 2013 (joint work with Marieke van Erp and others)

Page 47: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Representations (tools)

• How to represent and combine output of different tools?– Compatibility (easy to convert output of external

NLP tools)– Flexibility (be able to contain alternative

representations and interpretations)

Integrate representations in NIF (joint work with Jesper Hoeksema and Willem van Hage)

Page 48: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Representation (events)

• How to combine knowledge from the NLP community and Linked Data community?– Combination of textual information with external

resources– Complete representation of information from text

(location, retrieval method)

Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)

Page 49: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Current state of affairs

• Basic system using sentence number and lemmas for main categories metadata (evaluation ongoing)

• Module for labeling locations and dates in text (adaptions to be made for modularity)

• Annotation effort started for evaluation (selection of approximately 700 texts)

Page 50: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Demonstrator

Page 51: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

• The interface should be easy to use• The demonstrator should inspire historians to

undertake new research and give direction, rather than being the ‘closing factor’ in their research

• The interface should allow to ‘fine tune’ results returned upon an initial action

Interface: Focus

Page 52: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

• Query composition• Faceted browsing• A combination

Interface: Options

Page 53: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

• Drop down boxes to select ‘Verbs’, data elements and relations

Interface: Query composition

Page 54: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

• No explicit querying, but convergence of the data through browsing and selecting

• Provides better feedback to the user• Allows for more direct and easier

adjustment of the selected data

Interface: Faceted browsing

Page 55: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Interface: Faceted browsing

Page 56: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

• Query composition combined with faceted browsing

• Create new facets by defining a query– The result of the query is available as a subset of

the data by selecting the defined facet– As such, combinable with other facets

• Method to integrate ‘open’ querying of the data into a general interface and visualization

Interface: A combination

Page 57: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Interface: A combination

Question Analysis

SelectionProcess

Results

Data

Facets

Page 58: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Time and place are primary elements

Interface: Demonstrator

Results

?

Page 59: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual
Page 60: BiographyNed eScience Center 21 March 2013. Why a good case for eScience? Involves big data with high complexity Rich meta data joining diverse textual

Questions