data mining for scholarship
DESCRIPTION
An introduction to text mining on digital contentTRANSCRIPT
1
The Million Book Challenge: data mining
for scholarship
Alastair Dunning
Digitisation Programme Manager, JISC
0203 006 6065
2
From Keyhole to Open Door
Many scholars still approach digitised data with the ‘search+browse’ paradigm
All digitised resources are initially constructed in this way E.g. British Library 19th-Century Newspapers –
over 1m pages of text, billions of words – keyword searches tend to reveal lists of 100s of pages
Yet digitised resources can be analysed in their entirety
3
Open Door possibilities
Machine translation Analogue to text – e.g. identifying footnotes
within text, spotting the beginning and end of entries, encyclopedias, and gazetteers
Information Extraction (Semantic) recognition of people, places, dates,
and organizations, citations etc.
For scholars, new types of research in understanding primary (or secondary) sources
4
A Case Study: 17th-Cent. News
Thanks to Ian Gregory, Andrew Hardie (University of Lancaster) for this study
Lancaster Newsbooks Corpus 800,000 words of 1650s English newsbooks: Every surviving newsbook from mid-Dec 1653 to
end of May 1654 Freely available via AHDS Catalogue Methodological and technical problems exist –
skirted over here
5
1. Recognising geographies
Extracting individual “mentions” of place names from the corpus
The identification of proper nouns is accomplished via part-of-speech tagging, a well established technology within linguistics
6
In the Patrick of Liverpoole – which we lately recovered from the Brest men of War – was one Walter Roche – who was to carry her to Brest – and he informed us - that there are these Ships following belonging to Brest – who do so vex us in these Seas – viz.
<p> In_II the_AT <em> Patrick_NP1 </em> of_IO <em> <reg orig="Liverpoole"> Liverpool_NP1 </reg> </em> ,_, which_DDQ we_PPIS2 lately_RR recovered_VVN from_II the_AT <em> Brest_JJT </em> men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 <em> Walter_NP1 Roche_NP1 </em> ,_, who_PNQS was_VBDZ to_TO carry_VVI her_PPHO1 to_II <em> Brest_NP1 </em> ,_, and_CC he_PPHS1 informed_VVD us_PPIO2 ,_, that_CST there_EX are_VBR these_DD2 Ships_NN2 following_II belonging_VVG to_II <em> Brest_NP1 </em> ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2 in_II these_DD2 Seas_NN2 ,_, <em> viz._REX </em> </p>
7
2. Extracting place names and assigning co-ordinates
Proper nouns compared to a gazetteer We chose http://www.world-gazetteer.com Places outside Europe filtered out SQL database relational join Filters out non-place-name proper nouns Problem: duplicate place names (e.g. Newcastle
in Ireland) Each instance of a place name is associated
with (one or more) sets of coordinates
8
Google Earth
3. And on to GIS…
ArcGIS
Density smoothing in ArcGIS
(GIS – Geographical Information System)
9
4. Mapping by theme
What is being discussed in relation to each mention of each place-name? We cannot tell just from the dates and co-
ordinates Solution: concordance + semantic tagging
USAS system (University of Lancaster Semantic Analysis System)
Finding all terms related to a theme, e.g. money, cash, sterling, pound.
10
Identifying thematic associations (a): semantic tags in immediate context
<hit_word>Dunkirk</hit_word> <text>DutchDiurn03</text> of a rich Fleet from
Z5 Z5 I1:1u Z2 Z5 Dunkirk
Z2 , consisting of about forty
A1:8u Z5 A13:4 N1
11
Tag: G3 “warfare, etc”780 mentions
Mapping war…
Problems:• March – 18 mentions, 2 places• Munster – 12 mentions, 3 places• Newcastle – 5 mentions in west of Ireland• Manchester
• Middleton – 63 mentions, General in a rebellion in Scotland• Whalley – 10 mentions, General in a regiment of horse
12
Mapping money and government
I1 Money
140 mentions
G1 Government
293 mentions
13
1m Books Challenge (I)
Such analysis could be done on any dataset Concept developed by Greg Crane, Tufts
University, Director of Perseus Project Taken up by six funding bodies to create
international grant competition (from US, UK, Canada, Germany) – name t.b.c.
Competition to forge international partnerships to undertake type of work highlighted in case study
14
1m Books Challenge (II)
Will involve scholars, computer scientists, information managers and publishers
Competition is seeking to open up publishers’ content to allow for this type of analysis
Competition due to open in January 2009; significant time built into the call to allow for relationships with publishers to be developed
15
Technical & Legal Issues (I)
Scholars need to have a copy of the corpus / dataset to be analysed Difficulties in actually transferring large corpora Obvious IPR risks – material could be passed on
Or publishers need to make entire corpus available online Technically complex; requires powerful
infrastructure; how does online content interact with analytical tools?
16
Technical & Legal Issues (II) Experiments need to be repeatable
Peer-review demands that other scholars have access to a corpus to review peers’ analyses
Where are enriched datasets stored? Proliferating number of enriched datasets Demand for delivering enriched datasets (or parts of them)
for research and teaching Who owns the IPR in an enriched data set
Original publisher – yes. Researcher? Software, thesaurus and gazetteer creators?
How does this work for records, images, maps, audio, video?
17
Significant Challenges
Significant challenges exist for all stakeholders
But possibilities for exploiting investment in creating digital content
And potential for new avenues of research which scholars will wish to explore
‘Million Books Challenge’ will help explores some of these issues