Spoken corpora for learners: some trials and tribulationsAston corpus symposium
Guy Aston SSLMITUniversity of Bologna at Forlì[email protected]
Corpora for a specific learning context Interpreting public speaking
Not only scripted speech Not only monologue
Interpreting as multi-tasking students are 90% female reducing processing effort corpora as a source of variable multi-word units
(chunks: Pawley & Syder 1983, 2008)
What speech corpora? Understanding and producing public speaking
Not informal conversation Not read aloud written-to-be-spoken texts
Transcripts needed 1 million words @ 200wpm = £4000
Publicly available transcripts? Publicly available sound files?
Some possibilities? Parliamentary proceedings
Generally read/prepared Published proceedings (Hansard, Europarl) heavily regularised
Public lectures (LSE) “Transcripts” usually lecturers’ written texts
BNC spoken (other spoken materials) now rather old not very large (7M) very mixed (meetings, talks, lectures, lessons, tutorials, interviews,
medical consultations, sports broadcasts ...) no sound
Base, Micase (1.6M) mixed: lectures/seminars from across academia transcripts variously readable (punctuation)
BASE transcript (XSL transform) then there 's the environmental impact which we've (0.3) touched
on before (2.9) agriculture (0.3) when it 's practised (0.2) has raised (0.5) quite a considerable debate in in quite a number of ways (0.4) and (0.2) in most developed countries we do the work for intensive (0.5) agricultural production (1.2) there is very little (0.5) in this country which you could call a natural environment (0.5) it depends er (0.3) to a despite (0.6) people protesting (0.4) about (0.3) er (0.5) the effects on the countryside the countryside is almost completely agricultural (0.2) production it 's certainly (0.3) framed and (0.3) a product of agricultural production (0.7) so we do <laughing/> have to <normal/> (0.2) to bear this in mind (0.2) when you look at other countries and you talk about them (0.3) not destroying their rainforests and all the rest of it which i (0.6) fully sympathize with (0.2) you have to remember that 's exactly what we have done (0.7) we have taken out all our forests (0.4) and put in agriculture
Or ... Voice to text? Official transcripts of broadcast talk
Larry King show
Thank you Auntie Radio 4 transcripts
Read aloud written texts (From our own correspondent)
Heavily edited pieces of interviews (Analysis, File on 4, You and yours)
Money box (highly technical and UK specific) Andrew Marr (interviews with UK politicians) Any Questions (panel discussions of topical
questions, live) Copyright limitations!
Any questions? Weekly. Standard format Spontaneous speech (panellists don’t know
questions in advance) Chair + 4 different panellists each episode Topical (though mainly UK politics) Streamed audio (available for 4 weeks)
recordable Transcripts (available for 4 years)
My AQ corpus Official transcripts of 150 episodes (1.3M tokens)
May 2005-May 2008 50 transcripts checked against audio
Converted to TEI-XML (similar markup to BNC spoken) participants: role, sex, affiliation sections (questions) utterances/speakers sentence-like objects vocals and events: laughing, clapping, booing etc
Pos tagged and lemmatised (Claws7) Indexed and interrogable (Xaira)
same interface as BNC-XML
Transcript problems Inter-transcriber differences in
html markup use ignorance regularisations/omissions
repetitions/false starts pauses/fillers overlaps
spelling and punctuation (s/z, contracted forms, punctuation use)
use of abbreviations/symbols (UK)but designed for readability
A recent example One of two of them very necessary. For eg stopping MPs in London from
claiming expenses for second homes that frankly they don’t need but most crucially effective was this proposal from Gordon Brown that essentially MPs should have their London living allowance not claimed with receipts published in a transparent way but simply delivered by turning up with some sort of per deum rate and the idea that all of this row that we have had over the last year or two about MPs’ expenses and lack of transparency and accountability should bed delivered and solved by simply getting rid of all our claims completely and having no transparency so that we simply end up being awarded money for turning up without even having to submit bills is a nonsense. and it would import into Westminster politics which has already got enough problems with MPs’ expenses the worse features of Brussels politics in which precisely this happens and you see MEPs turning up on the day that they are supposed to be going home signing in to make sure they get the allowance and then scooting off on the next plane.
Advantages for learners Shared context simplifies interpretation Moderately readable transcripts
not too many errors not enough punctuation
Fairly followable against audio alignment?
Just about large (and small) enough for fairly common lexis?
in one go (1) lifeblood (1) very (ADJ) (116) .*ly important (120) back + up (VERB) (13) big + issue (60) I’d/would + like to see (88)
The next problem False starts
Learners didn’t use them (one-word-at-a-time) Relate (presumably) to lexical units (chunks) Stop and start the unit again
The next problem Learners make relatively few false starts –
they pause, think, and then continue performance fragmented
False starts perceived as performance flaws by interpreting teachers
Relation to multi-word chunks? no pauses in chunks? (Adolphs et al 2007) false starts should involve chunk restarts?
The In our time experiment Weekly chaired discussion with three experts on
historical/philosophical/scientific/literary topics, apparently unedited
15 pairs of students transcribed an episode of their choice as a vacation task transcribe half, check other’s half
Transcription guidelines punctuate using intonation include: long (>1s) pauses, repetitions, false starts, fillers, overlaps
Transcripts checked against audio marked up in TEI-XML indexed with Xaira
The In our time corpus 15 episodes (10 hours) 42 speakers 1592 utterances 5050 sentence-like objects 140,000 tokens not pos tagged/lemmatised
Advantages
less data to deal with data known
someone in the group should recognise/be able to explain the context
But is it any use? quite good for
general nouns/verbs (often textually cohesive) people 393 time 390 thing/things 361 happen / happened / happening / happens 167 idea 165
features associated with markup in transcript textual colligations (Hoey 2005) discontinuities (self-corrections / false starts)
utterance-initial well 145 yes 124 yeah 65 no 26
15 agreements
Syntactic/intonational discontinuities
Breakings off (sentence-final) 41 You wanted to say –
False starts (sentence-medial) c.3500 (6 per minute) Word truncations 270 Self-corrections
Can you develop that Steven, why Freud got there, and why then it reached out to quite a lot of people, er and seemed to affect – afflict them so much?
Self-repetitions So that again there 's this – this idea that when – when Vesalius, let 's
say, looks at the structures that he finds through – through dissection, he 's still trying to understand them in the main prevailing model that has been – has been existing for a thousand years, the Galenic model.
So ... Where do speakers get stuck? What do speakers do when they get stuck?
The last word -
where speakers get stuck
just a typical list of function words?
isyou er i we so was he this
the of and that a to in it s
the discontinuities Repetitions 300 Corrections 99
Repeated segmentthe 255 of the 14
one of the 2 in the 13 at the 2 on the 2 to the 2 and the 1 beyond the 1 but the 1 by the 1 for the 1 from the 1 is the 1 they are the 1
a discontinuities Repetitions 144 Corrections 56Repeated segmentsa 106 it ’s a 11 in a 7 there’s a 3 as a 3 is a 3 of a 3 has a 2 for a 2 and a 1 just a 1 with a 1 there’s a sort of a 1
that discontinuitiesRepetitions 106 Corrections 43
Repeated segments
that 101 I suppose that 1 in that 1 within that 1 did that 1 you have that 1
of discontinuitiesRepetitions 120 Corrections 22
Repeated segmentsof 101 by the end of 2 sort of 2
what sort of 1 array of 1 one of 1 there’s an age of 1 at the base of 1
we discontinuitiesRepetitions 52 Corrections 23
Repeated segments
we 46 what we 2 if we 1 that we 1 where we 1 which we 1
is discontinuitiesRepetitions 76 Corrections 40
Repeated segments
is 67 this is 4 there is 2 it is 2 that is 1
‘s discontinuitiesRepetitions 67 Corrections 44
Repeated segments it ’s 45
if it ’s 1 there ’s 11 that ’s 6 here ’s 1 let ‘s 1 what ‘s 1 which he ’s 1
n=10-19 about - about 4 at - at 9 for - for 10
epigram for - epigram for 1 guide for - guide for 1
how – how 11 I ‘m – I’m 11 which – which 9 been – been 2
has been – has been 1 had been – had been 1
can – can 1 we can – we can 2 you can – you can 1
have we have – we have 1 they have – they have 1
our – our 7
say – say 1 when – when 7 would – would 3
I would – I would 1 if – if 9
it’s as if - it’s as if 2 just – just 2 I mean – I mean 3 not – not 2
I ‘m not – I’m not 1 it ‘s not – it’s not 1
on – on 8 a book on – a book on 1
one – one 10 what one – what one 1
these – these 8 were – were 5 who – who 11
Overall discontinuities tend to follow function words discontinuities involve repetition rather than
correction repetitions are one or two words, very few
more why?
But my students have learned to backtrack more when they get stuck
Suggestions please
Clusters Recurrent sequences – chunks? Chunks which particular words occur in Only fairly frequent words (small corpus) lie – lie behind/with/underneath (phrasal
verbs an obvious eg: which ones are used frequently?)
Is this worth learning?
Because small and homogeneous, possible to identify recurrent patterns for common words in this genre (public speaking) look
have a(nother) look, say look think -> think of -> can(‘t) think of
With sound, we wouldn’t need good transcripts
at the end of the day and all the rest of it I think vs going to
and all the names (Tony Blair, Gordon Brown, David Cameron)
Nesi, H. & H. Basturkmen 2007. “Lexical bundles and discourse signalling in academic lectures”. In T. Nevalainen & S-K Tanskanen (eds) Lexical cohesion and corpus linguistics. Benjamins. 23-44.
Beeching, K. 1997. Applied linguistics C-ORAL-ROM Wray, A. 2002. _Formulaic Language and the
Lexicon_. Cambridge: Cambridge University Press.
Irina Dahlmann, Svenja Adolphs, Tom Rodden BAAL 2007, Edinburgh Multi-word units, fluency and pause annotation in spoken corpora EXTRACT (Learner corpus):
S1: What level what level of English do you want? Do you want a very high level of English? S2: I think I in the future S1: Mm…mmm S2: and I can communication with USA people and UK people S1: Mm…mmm S2: very fluencily[sic] …
Reviewed Corpora (pause and transcription information from the respective accompanying corpus websites if not stated otherwise) ANDOSL Australian National Database of Spoken Language BASE British Academic Spoken English Corpus BNC British National Corpus, spoken CHILDES Child Language Data Exchange System CHRISTINE (http://www.grsampson.net/RChristine.html) COLT Bergen Corpus of London Teenage Corpus ICE International Corpus of English LINDSEI-Ger Louvain International Database of Spoken English Interlanguage, German component (see Brand &
Kämmerer 2006) LLC London-Lund Corpus of Spoken English MICASE Michigan Corpus of Academic Spoken English TRAINS (http://www.cs.rochester.edu/research/cisd/projects/trains/) WCSNZE Wellington Corpus of Spoken New Zealand English VOICE Vienna-Oxford International Corpus of English
The average number of words per fluent unit is about six (Pawley and Syder (2000: 195)
utterance- and sentence-initial well