![Page 1: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/1.jpg)
Overview of corpus work in Norway
Norsk talekorpus (in preparation)
Bergen (GK)
Oslo (JBJ and HGS)
(Others: Colt, Uno, …)
![Page 2: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/2.jpg)
20.08.2002 2
Norsk talekorpus (‘Norwegian Speech Corpus’)
(Project in preparation)• Cooperation between Oslo (JBJ, HGS, Arne Torp,
Ruth Fjeld), Trondheim (Tor A. Åfarli) and Bergen (GK)
• Application for funding submitted to the Norwegian Research Council in June 02
• Aim: To create a representative corpus of Norwegian speech that can be part of the national language corpus that is now being planned
![Page 3: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/3.jpg)
Talesøk (‘Speech Search’)
A tool for automated search in recorded speech
Gjert Kristoffersen
Scandinavian Dept., University of Bergen
Agder University College
![Page 4: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/4.jpg)
20.08.2002 4
Organization of project
• Cooperation between Aksis and GK– Aksis: Division of Unifob, a research institute associated
with the University of Bergen
• Funding• Supported by the Meltzer foundation (1999) and the
Faculty of Arts, University of Bergen (2001, 2002)
• Persons• Knut Hofland, Aksis
• Gjert Kristoffersen, Dept. of Scandinavian
![Page 5: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/5.jpg)
20.08.2002 5
Aim of project
• To provide a means for obtaining efficient access to speech data from recordings– Basic ideological point: In speech research,
only the sound itself represents primary data. Any transcription of the sound is an analysis of the data, and therefore of secondary status
![Page 6: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/6.jpg)
20.08.2002 6
Use
• Useful in at least two fields– Testing of hypotheses in theoretical studies
• Intuition is not enough, you need real speech data also in theoretical studies
– Quantitative studies of language variation
![Page 7: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/7.jpg)
20.08.2002 7
Website
• http://www.hf.uib.no/i/Nordisk/talekorpus/Hovedside.htm
![Page 8: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/8.jpg)
20.08.2002 8
Use of Talesøk (all quantitative variation studies)
• Finished projects– Vibeke Notland (masters thesis published on the net)
• Work in progress– Dissertations
– Magnhild Selaas, Reidunn Hernes, Ragnhild Haugen (University of Bergen) Unn Røyneland (University of Oslo), Randi Solheim (Norwegian University of Science and Technology)
– Masters thesis– Anne Marit Budal
![Page 9: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/9.jpg)
20.08.2002 9
Architecture• Digitized recording aligned with a
transcription of the recording• The transcription is conceived as a
representation of structural properties underlying the spoken text
• The transcription can therefore be used as – a tool for searching for instantiations of these
structural properties– a means of direct access to these instantiations
![Page 10: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/10.jpg)
20.08.2002 10
Recordings
• Must be digitized
• Recordings of running text, such as conversations, sociolinguistic interviews etc.
![Page 11: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/11.jpg)
Transcription
![Page 12: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/12.jpg)
20.08.2002 12
Basic requirements of transcription
• Must give rapid and efficient access to the primary data, i.e. the recorded sound
• The transcription code must secure maximal consistency and cost efficiency, given the fact that transcription work is time consuming, and therefore expensive
![Page 13: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/13.jpg)
20.08.2002 13
Basic requirements of transcriptions
• If a transcription is to serve as a represen-tation of structural properties underlying the text, it should potentially serve as:– Phonological representation
• Segmental organization
• Prosodic organization
– Morphological representation– Syntactic representation
![Page 14: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/14.jpg)
20.08.2002 14
Basic requirements of transcriptions
• Transcriptions must also allow for lexical searches, i.e. searches for realizations of particular lexical items
![Page 15: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/15.jpg)
20.08.2002 15
Choice of transcription code
• Phonetic transcription?
• Phonemic transcription?
• Orthographic transcription?
![Page 16: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/16.jpg)
20.08.2002 16
Phonetic transcription
• Transcribers must be trained in phonetics• Extremely time consuming and therefore extremely
costly• Consistency both within and across transcribers is
very difficult to achieve• Cannot serve as a representation of structural
properties beyond the phonetics itself
• While phonetics is gradual, a ‘phonetic’ transcription is in essence categorial.
![Page 17: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/17.jpg)
20.08.2002 17
? Phonemic transcription
• Presupposes a unequivocal phonemic analysis of the variety to be transcribed
• Transcribers should ideally be phonologically trained
• Cannot serve as a representation of structural properties beyond segmental phonology unless it is manually tagged for additional properties
![Page 18: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/18.jpg)
20.08.2002 18
! Orthographic transcription
• Orthography is (reasonably) well known by most people, hence– minimal amount of training of transcribers is
required
– maximal efficiency and consistency, and therefore maximal cost efficiency can be obtained
– can in principle serve as a basis for automatic morphological tagging and perhaps for syntactic parsing
![Page 19: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/19.jpg)
20.08.2002 19
Relationship between ortho-graphy and underlying structure
• Segmental phonology– Orthographic transcription can serve as a
phonological representation to the extent that there exists a non-arbitrary relationship between the graphemic and the phonemic level
• Morphology– Orthographic transcriptions can be tagged
automatically
![Page 20: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/20.jpg)
20.08.2002 20
Relationship between ortho-graphy and underlying structure
• Lexis– Any word can be searched for in an orthographic
transcription. Tagging in addition gives access to different realizations of the same lexeme
• Syntax– Efficient syntactic analysis presupposes
automatic parsing. To the extent that this is feasible, the input must be in orthographic form
![Page 21: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/21.jpg)
20.08.2002 21
Relationship between ortho-graphy and underlying structure
• Prosody– Can to a certain extent be inferred from the
orthography. Presupposes mapping between transcription and a lexicon provided with information about stress and tone.
![Page 22: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/22.jpg)
20.08.2002 22
Transcription tool
• Praat– http://fonsg3.let.uva.nl/praat/– allows online transcription with automatic
insertion of time codes
![Page 23: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/23.jpg)
20.08.2002 23
Transcribing Norwegian speech
• Trade-off between phonology and morphology– Phonology profits from transcription of
morpho-phonological variation– Morphological tagging is facilitated by
minimizing transcription of morpho-phonological variation
![Page 24: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/24.jpg)
20.08.2002 24
Transcribing Norwegian speech
• Which norm?– Bokmål? – Nynorsk?
• Which subnorm?– Radical or moderate bokmål?– Moderate or conservative nynorsk?
![Page 25: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/25.jpg)
20.08.2002 25
Bokmål or nynorsk?
• Nynorsk closer to West Norwegian, bokmål to East Norwegian
• People do not respect the division of lexis between the two norms in their speech
• Example • Einebolig (nyn: einebustad, bokm: enebolig).
![Page 26: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/26.jpg)
20.08.2002 26
A hypernorm
• For any word in the recording, choose from the two norms the orthographic form of the stem that most closely fits the spoken form to be transcribed– ‘Mixed’ compounds and derivations
• einebolig (eine (nyn) + bolig (bm))
• bestemtheit (bestemt (bm) + -heit (nyn))
![Page 27: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/27.jpg)
20.08.2002 27
A hypernorm
• Inflectional endings and pronouns are transcribed consistently either in bokmål or nynorsk, depending on the dialect
![Page 28: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/28.jpg)
20.08.2002 28
Advantage of hypernorm
• Phonological searches are facilitated, because the match between orthography and speech is optimized
• A ‘hypertagger’ must be developed that can analyze texts that contains both nynorsk and bokmål forms.
![Page 29: Overview of corpus work in Norway Norsk talekorpus (in preparation) Bergen (GK) Oslo (JBJ and HGS) (Others: Colt, Uno, …)](https://reader035.vdocuments.us/reader035/viewer/2022081513/56649cec5503460f949b813d/html5/thumbnails/29.jpg)
20.08.2002 29
‘The Bergen Corpus’
• Ca. 500.000 words
• A bundle of small corpora that have not yet been fully integrated into one corpus
• Recordings are all sociolingustic interviews
• Dialects: Mostly West Norwegian
• Access: Restricted, but in principle accessible via the net