spoken corpora for learners: some trials and tribulations aston corpus symposium guy aston sslmit...

42
Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì [email protected]

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Spoken corpora for learners: some trials and tribulationsAston corpus symposium

Guy Aston SSLMITUniversity of Bologna at Forlì[email protected]

Page 2: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Corpora for a specific learning context Interpreting public speaking

Not only scripted speech Not only monologue

Interpreting as multi-tasking students are 90% female reducing processing effort corpora as a source of variable multi-word units

(chunks: Pawley & Syder 1983, 2008)

Page 3: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

What speech corpora? Understanding and producing public speaking

Not informal conversation Not read aloud written-to-be-spoken texts

Transcripts needed 1 million words @ 200wpm = £4000

Publicly available transcripts? Publicly available sound files?

Page 4: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Some possibilities? Parliamentary proceedings

Generally read/prepared Published proceedings (Hansard, Europarl) heavily regularised

Public lectures (LSE) “Transcripts” usually lecturers’ written texts

BNC spoken (other spoken materials) now rather old not very large (7M) very mixed (meetings, talks, lectures, lessons, tutorials, interviews,

medical consultations, sports broadcasts ...) no sound

Base, Micase (1.6M) mixed: lectures/seminars from across academia transcripts variously readable (punctuation)

Page 5: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

BASE transcript (XSL transform) then there 's the environmental impact which we've (0.3) touched

on before (2.9) agriculture (0.3) when it 's practised (0.2) has raised (0.5) quite a considerable debate in in quite a number of ways (0.4) and (0.2) in most developed countries we do the work for intensive (0.5) agricultural production (1.2) there is very little (0.5) in this country which you could call a natural environment (0.5) it depends er (0.3) to a despite (0.6) people protesting (0.4) about (0.3) er (0.5) the effects on the countryside the countryside is almost completely agricultural (0.2) production it 's certainly (0.3) framed and (0.3) a product of agricultural production (0.7) so we do <laughing/> have to <normal/> (0.2) to bear this in mind (0.2) when you look at other countries and you talk about them (0.3) not destroying their rainforests and all the rest of it which i (0.6) fully sympathize with (0.2) you have to remember that 's exactly what we have done (0.7) we have taken out all our forests (0.4) and put in agriculture

Page 6: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Or ... Voice to text? Official transcripts of broadcast talk

Larry King show

Page 7: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Thank you Auntie Radio 4 transcripts

Read aloud written texts (From our own correspondent)

Heavily edited pieces of interviews (Analysis, File on 4, You and yours)

Money box (highly technical and UK specific) Andrew Marr (interviews with UK politicians) Any Questions (panel discussions of topical

questions, live) Copyright limitations!

Page 8: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Any questions? Weekly. Standard format Spontaneous speech (panellists don’t know

questions in advance) Chair + 4 different panellists each episode Topical (though mainly UK politics) Streamed audio (available for 4 weeks)

recordable Transcripts (available for 4 years)

Page 9: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

My AQ corpus Official transcripts of 150 episodes (1.3M tokens)

May 2005-May 2008 50 transcripts checked against audio

Converted to TEI-XML (similar markup to BNC spoken) participants: role, sex, affiliation sections (questions) utterances/speakers sentence-like objects vocals and events: laughing, clapping, booing etc

Pos tagged and lemmatised (Claws7) Indexed and interrogable (Xaira)

same interface as BNC-XML

Page 10: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Transcript problems Inter-transcriber differences in

html markup use ignorance regularisations/omissions

repetitions/false starts pauses/fillers overlaps

spelling and punctuation (s/z, contracted forms, punctuation use)

use of abbreviations/symbols (UK)but designed for readability

Page 11: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

A recent example One of two of them very necessary. For eg stopping MPs in London from

claiming expenses for second homes that frankly they don’t need but most crucially effective was this proposal from Gordon Brown that essentially MPs should have their London living allowance not claimed with receipts published in a transparent way but simply delivered by turning up with some sort of per deum rate and the idea that all of this row that we have had over the last year or two about MPs’ expenses and lack of transparency and accountability should bed delivered and solved by simply getting rid of all our claims completely and having no transparency so that we simply end up being awarded money for turning up without even having to submit bills is a nonsense. and it would import into Westminster politics which has already got enough problems with MPs’ expenses the worse features of Brussels politics in which precisely this happens and you see MEPs turning up on the day that they are supposed to be going home signing in to make sure they get the allowance and then scooting off on the next plane.

Page 12: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Advantages for learners Shared context simplifies interpretation Moderately readable transcripts

not too many errors not enough punctuation

Fairly followable against audio alignment?

Page 13: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Just about large (and small) enough for fairly common lexis?

in one go (1) lifeblood (1) very (ADJ) (116) .*ly important (120) back + up (VERB) (13) big + issue (60) I’d/would + like to see (88)

Page 14: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The next problem False starts

Learners didn’t use them (one-word-at-a-time) Relate (presumably) to lexical units (chunks) Stop and start the unit again

Page 15: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The next problem Learners make relatively few false starts –

they pause, think, and then continue performance fragmented

False starts perceived as performance flaws by interpreting teachers

Relation to multi-word chunks? no pauses in chunks? (Adolphs et al 2007) false starts should involve chunk restarts?

Page 16: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The In our time experiment Weekly chaired discussion with three experts on

historical/philosophical/scientific/literary topics, apparently unedited

15 pairs of students transcribed an episode of their choice as a vacation task transcribe half, check other’s half

Transcription guidelines punctuate using intonation include: long (>1s) pauses, repetitions, false starts, fillers, overlaps

Transcripts checked against audio marked up in TEI-XML indexed with Xaira

Page 17: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The In our time corpus 15 episodes (10 hours) 42 speakers 1592 utterances 5050 sentence-like objects 140,000 tokens not pos tagged/lemmatised

Page 18: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Advantages

less data to deal with data known

someone in the group should recognise/be able to explain the context

Page 19: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

But is it any use? quite good for

general nouns/verbs (often textually cohesive) people 393 time 390 thing/things 361 happen / happened / happening / happens 167 idea 165

features associated with markup in transcript textual colligations (Hoey 2005) discontinuities (self-corrections / false starts)

Page 20: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

utterance-initial well 145 yes 124 yeah 65 no 26

15 agreements

Page 21: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Syntactic/intonational discontinuities

Breakings off (sentence-final) 41 You wanted to say –

False starts (sentence-medial) c.3500 (6 per minute) Word truncations 270 Self-corrections

Can you develop that Steven, why Freud got there, and why then it reached out to quite a lot of people, er and seemed to affect – afflict them so much?

Self-repetitions So that again there 's this – this idea that when – when Vesalius, let 's

say, looks at the structures that he finds through – through dissection, he 's still trying to understand them in the main prevailing model that has been – has been existing for a thousand years, the Galenic model.

Page 22: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

So ... Where do speakers get stuck? What do speakers do when they get stuck?

Page 23: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The last word -

where speakers get stuck

just a typical list of function words?

isyou er i we so was he this

the of and that a to in it s

Page 24: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

the discontinuities Repetitions 300 Corrections 99

Repeated segmentthe 255 of the 14

one of the 2 in the 13 at the 2 on the 2 to the 2 and the 1 beyond the 1 but the 1 by the 1 for the 1 from the 1 is the 1 they are the 1

Page 25: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

a discontinuities Repetitions 144 Corrections 56Repeated segmentsa 106 it ’s a 11 in a 7 there’s a 3 as a 3 is a 3 of a 3 has a 2 for a 2 and a 1 just a 1 with a 1 there’s a sort of a 1

Page 26: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

that discontinuitiesRepetitions 106 Corrections 43

Repeated segments

that 101 I suppose that 1 in that 1 within that 1 did that 1 you have that 1

Page 27: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

of discontinuitiesRepetitions 120 Corrections 22

Repeated segmentsof 101 by the end of 2 sort of 2

what sort of 1 array of 1 one of 1 there’s an age of 1 at the base of 1

Page 28: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

we discontinuitiesRepetitions 52 Corrections 23

Repeated segments

we 46 what we 2 if we 1 that we 1 where we 1 which we 1

Page 29: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

is discontinuitiesRepetitions 76 Corrections 40

Repeated segments

is 67 this is 4 there is 2 it is 2 that is 1

Page 30: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

‘s discontinuitiesRepetitions 67 Corrections 44

Repeated segments it ’s 45

if it ’s 1 there ’s 11 that ’s 6 here ’s 1 let ‘s 1 what ‘s 1 which he ’s 1

Page 31: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

n=10-19 about - about 4 at - at 9 for - for 10

epigram for - epigram for 1 guide for - guide for 1

how – how 11 I ‘m – I’m 11 which – which 9 been – been 2

has been – has been 1 had been – had been 1

can – can 1 we can – we can 2 you can – you can 1

have we have – we have 1 they have – they have 1

our – our 7

say – say 1 when – when 7 would – would 3

I would – I would 1 if – if 9

it’s as if - it’s as if 2 just – just 2 I mean – I mean 3 not – not 2

I ‘m not – I’m not 1 it ‘s not – it’s not 1

on – on 8 a book on – a book on 1

one – one 10 what one – what one 1

these – these 8 were – were 5 who – who 11

Page 32: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Overall discontinuities tend to follow function words discontinuities involve repetition rather than

correction repetitions are one or two words, very few

more why?

But my students have learned to backtrack more when they get stuck

Page 33: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Suggestions please

Page 34: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it
Page 35: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it
Page 36: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Clusters Recurrent sequences – chunks? Chunks which particular words occur in Only fairly frequent words (small corpus) lie – lie behind/with/underneath (phrasal

verbs an obvious eg: which ones are used frequently?)

Is this worth learning?

Page 37: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Because small and homogeneous, possible to identify recurrent patterns for common words in this genre (public speaking) look

have a(nother) look, say look think -> think of -> can(‘t) think of

With sound, we wouldn’t need good transcripts

Page 38: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

at the end of the day and all the rest of it I think vs going to

and all the names (Tony Blair, Gordon Brown, David Cameron)

Page 39: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Nesi, H. & H. Basturkmen 2007. “Lexical bundles and discourse signalling in academic lectures”. In T. Nevalainen & S-K Tanskanen (eds) Lexical cohesion and corpus linguistics. Benjamins. 23-44.

Beeching, K. 1997. Applied linguistics C-ORAL-ROM Wray, A. 2002. _Formulaic Language and the

Lexicon_. Cambridge: Cambridge University Press.

Page 40: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

Irina Dahlmann, Svenja Adolphs, Tom Rodden BAAL 2007, Edinburgh Multi-word units, fluency and pause annotation in spoken corpora EXTRACT (Learner corpus):

S1: What level what level of English do you want? Do you want a very high level of English? S2: I think I in the future S1: Mm…mmm S2: and I can communication with USA people and UK people S1: Mm…mmm S2: very fluencily[sic] …

Reviewed Corpora (pause and transcription information from the respective accompanying corpus websites if not stated otherwise) ANDOSL Australian National Database of Spoken Language BASE British Academic Spoken English Corpus BNC British National Corpus, spoken CHILDES Child Language Data Exchange System CHRISTINE (http://www.grsampson.net/RChristine.html) COLT Bergen Corpus of London Teenage Corpus ICE International Corpus of English LINDSEI-Ger Louvain International Database of Spoken English Interlanguage, German component (see Brand &

Kämmerer 2006) LLC London-Lund Corpus of Spoken English MICASE Michigan Corpus of Academic Spoken English TRAINS (http://www.cs.rochester.edu/research/cisd/projects/trains/) WCSNZE Wellington Corpus of Spoken New Zealand English VOICE Vienna-Oxford International Corpus of English

Page 41: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

The average number of words per fluent unit is about six (Pawley and Syder (2000: 195)

Page 42: Spoken corpora for learners: some trials and tribulations Aston corpus symposium Guy Aston SSLMIT University of Bologna at Forlì guy@sslmit.unibo.it

utterance- and sentence-initial well