discourse on disk: tales from the scottish corpus, david beavan, university of glasgow, 2007
DESCRIPTION
Discourse on Disk: tales from the scottish corpus. Story of the building of the Scottish Corpus of Texts and Speech http://www.scottishcorpus.ac.uk/ from 2007. Web site: http://www.scottishcorpus.ac.uk/TRANSCRIPT
Scottish Corpus
§ Scottish Corpus of Texts & Speech (SCOTS)
§ Modern day (1945 to present) § Scottish English and Scots § Written and spoken documents § Wide range of genres § 4,000,000 word corpus § 20% (800,000 word) spoken
Scottish Corpus
§ Externally funded • 2 years
Engineering & Physical Sciences Research Council (EPSRC) in collaboration with Language Technology Group, University of Edinburgh
• 3 years Arts & Humanities Research Council (AHRC)
§ Project completed June 2007
Distinctive features
§ Free – no cost § Free – no access limitations § Entirely internet delivered § Must solicit all corpus contents § Legal concerns
• Copyright • Data protection
Comparison
SCOTS ICE-GB BNC Cost Free £750 £500 Words 4m 1m 100m Regional ü Spoken ü ü ü Audio ü ü CD-ROM ü ü Internet ü *
* 3rd party tools available
Corpus building
§ Contact management § Tracking of paperwork § Mail merges § Analysis and ad-hoc reports § Ensure adherence to legal issues § Entry of corpus data
§ All under one roof
Website
§ Access to corpus data § Search and browse functions § Common analysis features
§ Focus on usability § Employ new web technologies § Mash it up
Transcriptions
§ Orthographic transcription § Multiple speakers
• Overlap • Interruption
§ Synchronise • Orthographic transcription • Audio/video footage
Maps
§ Geographic data held • Place of birth/residence • Place of birth for mother/father
§ Normalise locations § Source map imagery § Incorporate navigation
§ Google Maps mashup
Is this Glasgow?
Flat x/x xx Fergus Drive North Kelvinside Glasgow Strathclyde Gxx xxx
xx Dumbarton Road Partick North Lanarkshire Gxx xxx
xx Great Western Road Anniesland Glasgow Gxx xxx
Maps
§ Locations • Postcodes too local • Counties too general • Cities/towns/villages good compromise
§ Ordnance Survey Gazetteer • Automatically find equivalents • Manually assign ‘awkward’ cases
Public awareness
§ Media • Press releases • Magazine articles • Radio interviews • Online articles
§ Web • Open up entire corpus to search engine • Submit site in directories and catalogues
Popular authors
burns 99 852 66 begg 22 corbett 15 leonard 15 the driver 15 blackhall 14 gibbon 14 scott 14 harrison 12
Popular documents
SKARRS 1599 een fur d pyoorists 1490 The Annals of Arbuthnott Part One 763 A Blether an wee Bevvy 703 Buchan Words and Ways 626 A Bit of a Wally 554 BBC Voices Recording: Glasgow 542 Learning Through Reading and Writing: Informational Texts
535
A Sair Fecht 508 Katie Morag and the Two Grandmothers 476
Web hits
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Aug-04 Feb-05 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07
Emails
“I'm curious as to why none of my Scots work is included in the corpus. I'm fairly well known as a Scots-writing poet and anthologist. I recorded poems for <person> some years ago and assumed they would be here. Maybe I'm not searching properly but <surname> seems to be a non-person here.”
“<long list of books> … Which one of us is missing these wonderful works? I hope it's me for if it's you this corpus is sorely under developed.”
Emails
“I don't know how to spell the word - it sounds like "bahouchie" and I think it means "behind" or rudely -"bum". What is the correct spelling and is the meaning correct?”
“This is just a note to thank the people who worked so very hard to bring this online resource to the world. It's a fabulous help, and I, for one, am incredibly grateful! Thank you.”
“Think I'm going to have to download the whole thing very very soon...”