1 15 july 2007 (c) m.greengrass data extraction across multiple text datasets for arts and...

42
1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield Armadillo

Upload: shona-rogers

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

1

15 July 2007 (c) M.Greengrass

Data Extraction Across Multiple Text Datasets for Arts and Humanities Research

Mark GreengrassUniversity of Sheffield

Armadillo

Page 2: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

2

15 July 2007 (c) M.Greengrass

Response to the RePAH questionnaire (2005-6), aggregate of all Arts and Humanities respondants (Repah: A User Requirements Analysis Report (2006), p. 102.

Page 3: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

3

15 July 2007 (c) M.Greengrass

Repah, A user requirements analysis… (2006), p. 109

Page 4: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

4

15 July 2007 (c) M.Greengrass

Some Distinctive Features of in Historians’ Approach to their Evidence

• Promiscuous range of sources consulted

• Firm distinction between primary and secondary sources

• Complex dialogue between existing historiography and constitutive source materials

• Reiterative process of open interrogation of source materials

• A ‘coherent’ narrative consists of one composed (generally) from more than one source

Page 5: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

5

15 July 2007 (c) M.Greengrass

Historians’ Database Challenge•Growing number of (mainly text-based) historical datasets in electronic media, furnished from a wide variety of providers

• These datasets utilise a variety of different historical sources

• They contain varying amounts of encoded information (dependant on the historical question being asked by the PI; and by the constraints of the particular source being used)

• The information is encoded in different ways

• The delivery formats used also vary widely

Page 6: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

6

15 July 2007 (c) M.Greengrass

Page 7: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

7

15 July 2007 (c) M.Greengrass

Sources

Metropolitan London in the

1690sIHR

House of Lords JournalsBOPCRIS

St. Martin’s Settlement

Exams IndexWESTCAT

The Marine Society Registers

Collage image databse

Guildhall Library

Eighteenth Century Fire

Insurance Policies

Selected Criminal Records

TNA

John Strype’s “Survey…”

Prerogative Court of

Canterbury Wills

The Westminster Historical Database

Harben’s Dictionary of

London

The Proceedings of the Old Bailey AHDS Deposits

http://www.motco.com

Page 8: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

8

15 July 2007 (c) M.Greengrass

The Old Bailey Proceedings: XML

<trial><p>

<person> <defend gender="m"><given>William</given><surname>Mawn</surname></defend> </person> was Tryed for <off> <theft type="animals">stealing a Bay Gelding price 20 l.</theft> </off> from one <victim gender="m"><given>Thomas</given><surname>Lane</surname></victim> out of Berkshire on the <cd>25th of April</cd>. The Witness swore that the Horse was found in the Prisoner's custody in Smithfield, which the Prosecutor owned to be his. The Prisoner could not produce any Evidence to prove that he came honestly by the Horse only produc'd a Felonious person, that was no stranger to Newgate, who went under the Notion of his Man, he declared that the Prisoner bought the Horse upon the Road beyond Uxbridge. The Prisoners being found in several faultering stories, he was found <verdict> <guilty>Guilty</guilty> </verdict>.</p> <p> <punish><death><note type="editorial">[Death. See summary.]</note></death></punish> </p> </trial>

Page 9: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

9

15 July 2007 (c) M.Greengrass

Canterbury Wills: Delimited Text

2530553 W Agnes Kervill or Kervytt2530553 W Andrew Bridham London

2530553 W Andrew Pykeman London

2530553 W Austin Hawkyns2530553 W Cecilia Foster2530553 W Christian Chepman2530553 W Christian Cust2530553 W David Syadine Bristol,2530553 W Edmund Bybbesworth2530553 W Edward Wellys Hadley, 2530553 W Ellen Lacy Widow Saint Pe2530553 W GerardHeshull2530553 W Guy Shuldham2530553 W Helmingus Leget2530553 W Henry Porter2530553 W Henry Warlegh Keynesha2530553 W Henry Wellis2530553 W Hugh Caundyssh2530553 W Hugh Geynesburgh Rector2530553 W Isabelle Woodhill

Page 10: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

10

15 July 2007 (c) M.Greengrass

The IssuesCan the technologies developed for the ‘semantic web’ help us:-

• To structure the (different) encoded information across varying sources in a way that the user community will find (research) fruitful?

• To understand the way in which these different sources relate to one another, such that they can be used in an intelligent fashion?

• To ‘bootstrap’ relevant historical/semantic information from one source, by using another?

Page 11: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

11

15 July 2007 (c) Oscar Korcho (with acknowledgement)

Data ‘Sharing’ and Data ‘Re-use’

Reuse means to build new applications, assembling components already built

Sharing is when different applications use the same resources

Page 12: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

12

15 July 2007 (c) O. Corcho (with acknowledgement)

Interaction Problem

Representing Knowledge for the purpose of solving some problem

is strongly affected by the nature of the problem

and the inference strategy to be applied to the problem

Bylander Chandrasekaran, B. Generic Tasks in knowledge-based reasoning.: the right level of abstraction for knowledge acquisition. In B.R. Gaines and J. H. Boose, EDs Knowledge Acquisition for Knowledge Based systems, 65-77, London: Academic Press 1988.

Problem Solving MethodsOntologies

Describe the reasoning process of a dataset

(‘Knowledge-Based System’) in

a domain-independent manner

Describe domain knowledge in a generic way

and provide agreed understanding of a domain

Page 13: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

13

15 July 2007 (c) O. Corcho (with acknowledgement)

1. “An ontology defines the basic terms and relations

comprising the vocabulary of a topic area, as well as the

rules for combining terms and relations to define

extensions to the vocabulary”

Neches R, Fikes RE, Finin T, Gruber TR, Senator T, Swartout WR (1991) Enabling technology for knowledge sharing. AI Magazine 12(3):36–56

2. “An ontology is an explicit specification of a conceptualization”

Gruber TR (1993a) A translation approach to portable ontology specification. Knowledge Acquisition 5(2):199–220

3. “An ontology is a formal, explicit specification of a shared conceptualization”

4. “A logical theory which gives on explicit, partial account of a conceptualization”

5. “A set of logical axioms designed to account for the intended meaning of a vocabulary”

Guarino N (1998) Formal Ontology in Information Systems. In: Guarino N (ed) 1st International Conference on

Formal Ontology in Information Systems (FOIS’98). Trento, Italy. IOS Press, Amsterdam, pp 3–15

Definitions of an Ontology

Studer R, Benjamins VR, Fensel D (1998) Knowledge Engineering: Principles and Methods.IEEE Transactions on Data and Knowledge Engineering 25(1-2):161–197

Guarino N, Giaretta P (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In: Mars N (ed)Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (KBKS’95). University of Twente,Enschede, The Netherlands. IOS Press, Amsterdam, The Netherlands, pp 25–32

Page 14: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

14

15 July 2007 (c) M.Greengrass

Key Components of an Ontology

Concepts are organized in taxonomies

Relations

Functions

Axioms

Instances

R: C1 x C2 x ... x Cn-1 x Cn

F: C1 x C2 x ... x Cn-1 --> Cn

Elements

Sentences which are always true

Subclass-of: Concept 1 x Concept2Connected to: Component1 x Component2

Mother-of: Person --> WomenPrice of a used car: Model x Year x Kilometers --> Price

Page 15: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

15

15 July 2007 (c) M.Greengrass, after Corcho

Shared human consensus

Implicit

Semantics hardwired; used at runtime

Formal(for humans)

Semantic Continuum and Formality

Text descriptions

Informal [explicit]

Semantics processed and used at runtime

Formal [for machines]

e.g. Language e.g. dictionaries e.g. library catalogues

E.g. see below

Page 16: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

16

15 July 2007 (c) M.Greengrass

Page 17: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

17

15 July 2007 (c) M.Greengrass

Page 18: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

18

15 July 2007 (c) M.Greengrass

http://www.vicodi.org

Page 19: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

19

15 July 2007 (c) M.Greengrass

Primary sources (historical documents; images; artefacts) in elecronic media

Web-based ‘secondary’ historical writing

‘top-down ontologies’ (generated from discipline-accepted taxonomies)

‘bottom-up ontologies’ (generated from a representative sample of canonical data

‘middle-out ontologies’ (generated by intelligent iteration)

Page 20: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

20

15 July 2007 (c) M.Greengrass

Page 21: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

21

15 July 2007 (c) M.Greengrass

John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668)

Page 22: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

22

15 July 2007 (c) M.Greengrass

Page 23: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

23

15 July 2007 (c) M.Greengrass

Page 24: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

24

15 July 2007 (c) M.Greengrass

Page 25: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

25

15 July 2007 (c) M.Greengrass

Page 26: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

26

15 July 2007 (c) M.Greengrass

Page 27: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

27

15 July 2007 (c) M.Greengrass

Page 28: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

28

15 July 2007 (c) M.Greengrass

Page 29: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

29

15 July 2007 (c) M.Greengrass

Page 30: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

30

15 July 2007 (c) M.Greengrass

Armadillo – a Semantic Agent

Retrieves information according to pre-agreed ontologies

Takes account of deviations in spelling, typographic formatting and contextual information

Makes use of delimited fields and tagged data as ‘oracles’ to provide firm instantiations of elements in an ontology to apply to electronic materials which have no such structure

Page 31: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

31

15 July 2007 (c) M.Greengrass

Page 32: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

32

15 July 2007 (c) M.Greengrass

Page 33: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

33

15 July 2007 (c) M.Greengrass

Page 34: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

34

15 July 2007 (c) M.Greengrass

Page 35: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

35

15 July 2007 (c) M.Greengrass

Page 36: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

36

15 July 2007 (c) M.Greengrass

Page 37: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

37

15 July 2007 (c) M.Greengrass

Page 38: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

38

15 July 2007 (c) M.Greengrass

Page 39: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

39

15 July 2007 (c) M.Greengrass

Page 40: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

40

15 July 2007 (c) M.Greengrass

Page 41: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

41

15 July 2007 (c) M.Greengrass

<p>CENTRAL CRIMINAL COURT,</p><p>Held on Monday, December 17th, 1866, and following days,</p><p><sc>BEFORE THE RIGHT HON.</sc> <lc><name role="judiciary" given="THOMAS" surname="GABRIEL" sex="m" age="na">THOMAS GABRIEL</name>, LORD MAYOR</lc> of the City of London; Sir <sc><name role="judiciary" given="JOHN" surname="MELLOR" sex="m" age="na">JOHN MELLOR</name></sc>, Knt., one of the Justices of Her Majesty's Court of Queen's Bench; <sc><name role="judiciary" given="WILLIAM TAYLOR" surname="COPELAND" sex="m" age="na">WILLIAM TAYLOR COPELAND</name></sc>, Esq., <sc><name role="judiciary" given="THOMAS" surname="CHALLIS" sex="m" age="na">THOMAS CHALLIS</name></sc>, Esq., <sc>THOMAS QUESTED FINNIS</sc>, Esq., Sir <sc><name role="judiciary" given="ROBERT WALTER" surname="CARDEN" sex="m" age="na">ROBERT WALTER CARDEN</name></sc>, Knt., and <sc><name role="judiciary" given="WILLIAM" surname="LAWRENCE" sex="m" age="na">WILLIAM LAWRENCE</name></sc>, Esq., Aldermen of the said City;

Automated Text-Mining, used for tagging purposes in Central Criminal Court records

Page 42: 1 15 July 2007 (c) M.Greengrass Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield

42

15 July 2007 (c) M.Greengrass

<p>CENTRAL CRIMINAL COURT,</p><p>Held on Monday, July 22nd, 1912, and following days.</p><p>Before the Right Hon. Sir <lc>THOMAS BOOR CROSBY, M.D., LORD MAYOR</lc> of the said City of London; the Right Hon. Lord <sc>COLERIDGE</sc>, one of the Justices of His Majesty's High Court; Sir <sc><name role="judiciary" given="HENRY" surname="KNIGHT" sex="m" age="na">HENRY KNIGHT</name></sc>, Knight; Sir <sc><name role="judiciary" given="HORATIO" surname="DAVIES" sex="m" age="na">HORATIO DAVIES</name></sc>, K.C.M.G.; Sir <sc><name role="judiciary" given="JOHN" surname="POUND" sex="m" age="na">JOHN POUND</name></sc>, Bart.; Sir <sc>GEORGE W. TRUSCOTT</sc>, Bart.; Sir <sc><name role="judiciary" given="CHARLES" surname="JOHNSTON" sex="m" age="na">CHARLES JOHNSTON</name></sc>, Knight; and Sir <sc>HORACE B. MARSHALL</sc>, Knight, LL.D., Aldermen of the said City; Sir <sc>FORREST FULTON</sc>, Knight, K.C., Recorder of the said City; Sir <sc>FK. ALBERT BOSANQUET</sc>, K.C., Common Serjeant of the said City;

Automated Text-Mining, used for tagging purposes in Central Criminal Court records – with less success!

Not identified

Not identified