data mining for scholarship

1

The Million Book Challenge: data mining

for scholarship

Alastair Dunning

Digitisation Programme Manager, JISC

0203 006 6065

[email protected]

2

From Keyhole to Open Door

Many scholars still approach digitised data with the ‘search+browse’ paradigm

All digitised resources are initially constructed in this way E.g. British Library 19th-Century Newspapers –

over 1m pages of text, billions of words – keyword searches tend to reveal lists of 100s of pages

Yet digitised resources can be analysed in their entirety

3

Open Door possibilities

Machine translation Analogue to text – e.g. identifying footnotes

within text, spotting the beginning and end of entries, encyclopedias, and gazetteers

Information Extraction (Semantic) recognition of people, places, dates,

and organizations, citations etc.

For scholars, new types of research in understanding primary (or secondary) sources

4

A Case Study: 17th-Cent. News

Thanks to Ian Gregory, Andrew Hardie (University of Lancaster) for this study

Lancaster Newsbooks Corpus 800,000 words of 1650s English newsbooks: Every surviving newsbook from mid-Dec 1653 to

end of May 1654 Freely available via AHDS Catalogue Methodological and technical problems exist –

skirted over here

5

1. Recognising geographies

Extracting individual “mentions” of place names from the corpus

The identification of proper nouns is accomplished via part-of-speech tagging, a well established technology within linguistics

6

In the Patrick of Liverpoole – which we lately recovered from the Brest men of War – was one Walter Roche – who was to carry her to Brest – and he informed us - that there are these Ships following belonging to Brest – who do so vex us in these Seas – viz.

 In_II the_AT Patrick_NP1 of_IO <reg orig="Liverpoole"> Liverpool_NP1 </reg> ,_, which_DDQ we_PPIS2 lately_RR recovered_VVN from_II the_AT Brest_JJT men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 Walter_NP1 Roche_NP1 ,_, who_PNQS was_VBDZ to_TO carry_VVI her_PPHO1 to_II Brest_NP1 ,_, and_CC he_PPHS1 informed_VVD us_PPIO2 ,_, that_CST there_EX are_VBR these_DD2 Ships_NN2 following_II belonging_VVG to_II Brest_NP1 ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2 in_II these_DD2 Seas_NN2 ,_, viz._REX

7

2. Extracting place names and assigning co-ordinates

Proper nouns compared to a gazetteer We chose http://www.world-gazetteer.com Places outside Europe filtered out SQL database relational join Filters out non-place-name proper nouns Problem: duplicate place names (e.g. Newcastle

in Ireland) Each instance of a place name is associated

with (one or more) sets of coordinates

8

Google Earth

3. And on to GIS…

ArcGIS

Density smoothing in ArcGIS

(GIS – Geographical Information System)

9

4. Mapping by theme

What is being discussed in relation to each mention of each place-name? We cannot tell just from the dates and co-

ordinates Solution: concordance + semantic tagging

USAS system (University of Lancaster Semantic Analysis System)

Finding all terms related to a theme, e.g. money, cash, sterling, pound.

10

Identifying thematic associations (a): semantic tags in immediate context

<hit_word>Dunkirk</hit_word> <text>DutchDiurn03</text> of a rich Fleet from

Z5 Z5 I1:1u Z2 Z5 Dunkirk

Z2 , consisting of about forty

A1:8u Z5 A13:4 N1

11

Tag: G3 “warfare, etc”780 mentions

Mapping war…

Problems:• March – 18 mentions, 2 places• Munster – 12 mentions, 3 places• Newcastle – 5 mentions in west of Ireland• Manchester

• Middleton – 63 mentions, General in a rebellion in Scotland• Whalley – 10 mentions, General in a regiment of horse

12

Mapping money and government

I1 Money

140 mentions

G1 Government

293 mentions

13

1m Books Challenge (I)

Such analysis could be done on any dataset Concept developed by Greg Crane, Tufts

University, Director of Perseus Project Taken up by six funding bodies to create

international grant competition (from US, UK, Canada, Germany) – name t.b.c.

Competition to forge international partnerships to undertake type of work highlighted in case study

14

1m Books Challenge (II)

Will involve scholars, computer scientists, information managers and publishers

Competition is seeking to open up publishers’ content to allow for this type of analysis

Competition due to open in January 2009; significant time built into the call to allow for relationships with publishers to be developed

15

Technical & Legal Issues (I)

Scholars need to have a copy of the corpus / dataset to be analysed Difficulties in actually transferring large corpora Obvious IPR risks – material could be passed on

Or publishers need to make entire corpus available online Technically complex; requires powerful

infrastructure; how does online content interact with analytical tools?

16

Technical & Legal Issues (II) Experiments need to be repeatable

Peer-review demands that other scholars have access to a corpus to review peers’ analyses

Where are enriched datasets stored? Proliferating number of enriched datasets Demand for delivering enriched datasets (or parts of them)

for research and teaching Who owns the IPR in an enriched data set

Original publisher – yes. Researcher? Software, thesaurus and gazetteer creators?

How does this work for records, images, maps, audio, video?

17

Significant Challenges

Significant challenges exist for all stakeholders

But possibilities for exploiting investment in creating digital content

And potential for new avenues of research which scholars will wish to explore

‘Million Books Challenge’ will help explores some of these issues

data mining for scholarship

Education