oriental cocosda: past, present and future
DESCRIPTION
Oriental COCOSDA: Past, Present and Future. Shuichi ITAHASHI National Institute of Informatics (NII), Tokyo, Japan AIST, Tsukuba, Japan Chiu-yu TSENG Academia Sinica, Taipei, Taiwan Satoshi NAKAMURA ATR Spoken Language Communication Res. Labs., Kyoto, Japan. Contents. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/1.jpg)
LREC 2006 May. 24-26 Genoa, Italy
1
Oriental COCOSDA:Oriental COCOSDA:Past, Present and Past, Present and
FutureFuture
Shuichi ITAHASHINational Institute of Informatics (NII), Tokyo, Japan
AIST, Tsukuba, Japan
Chiu-yu TSENGAcademia Sinica, Taipei, Taiwan
Satoshi NAKAMURAATR Spoken Language Communication Res. Labs., Kyoto, Japan
![Page 2: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/2.jpg)
LREC 2006 May. 24-26 Genoa, Italy
2
Contents
1. Necessity of Speech Corpora
2. Organizations for Speech Corpora
3. Asian Languages
4. Brief History
5. Goals & Strategies
6. Regional Activities
7. Conclusion
![Page 3: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/3.jpg)
LREC 2006 May. 24-26 Genoa, Italy
3
Necessity of Speech Corpus
Speech Research
↑ Objectivity of Research
Speech Data ↑ + → Openness to the Public
Related Information ↓ ↓ Preserving Cultural Legacy
Preservation of
Spoken Language Data
![Page 4: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/4.jpg)
LREC 2006 May. 24-26 Genoa, Italy
4
Organizing Creation & Utilization of Speech Corpora
Creation of speech corpora needs some cost.Utilization needs a system to distribute
corpora.Some activities started early in 1990s.
1991 COCOSDA 1992 LDC in U.S.A. 1995 ELRA in Europe
![Page 5: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/5.jpg)
LREC 2006 May. 24-26 Genoa, Italy
5
COCOSDA
International Coordinating Committee on Speech Data
bases and Speech I/O Systems Assessment
Workshops held annually at Interspeech
Cocosda promotes the development of spoken languag
e corpora for building and/or evaluating spoken langua
ge technology and offers coordination of projects and r
esearch efforts to improve their efficiency.
![Page 6: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/6.jpg)
LREC 2006 May. 24-26 Genoa, Italy
6
Features of Asian Languages
1. Many languages belong to different language f
amilies.
2. Variety of orthographic systems
Various letters/characters used
3. Some tonal languages
4. No space between words in some languages
5. Non-unique romanization systems
![Page 7: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/7.jpg)
LREC 2006 May. 24-26 Genoa, Italy
7
Language Families of Asian Languages
1. Austronesian (1268 languages): Malay, Indonesian, etc.
2. Sino-Tibetan (403): Chinese, Tibetan, Burmese, etc.
3. Austro-Asiatic (169): Khmer, Vietnamese, etc.
4. Tai-Kadai (76): Thai, Lao, etc.
5. Dravidian (73): Tamil, Telugu, etc.
6. Altaic (66): Mongolian, Turkic, Korean, etc.
7. Japanese (12): Japanese, Ryukyuan, etc.
cf. Indo-European (449) by Ethnologue.com
![Page 8: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/8.jpg)
LREC 2006 May. 24-26 Genoa, Italy
8
Letters, Tone & Word Order
1. Proper letters: Burmese, Chinese, Japanese,
Khmer, Korean, Thai, etc.
2. Latin letters: Indonesian, Malay,
Vietnamese, etc.
3. Tonal languages: Burmese, Chinese, Lao,
Thai, Vietnamese, etc.
4. Word order: SOV, SVO, VSO, VOS
![Page 9: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/9.jpg)
LREC 2006 May. 24-26 Genoa, Italy
9
Word boundary in text
1. No space between words: Burmese,
Chinese, Japanese, Khmer, Lao, Thai, etc.
2. Space between words: Indonesian, Malay,
Mongolian, Vietnamese, etc.
![Page 10: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/10.jpg)
LREC 2006 May. 24-26 Genoa, Italy
10
Asian Activities
1994, 1997 Oriental COCOSDA
1999 GSK (Language Resource Association) in Japan
2001 SITEC in Korea (Speech Information Technology & Industry Promotion Center)
2002 Chinese LDC
CCC (Chinese Corpus Consortium) in China
2006 NII-SRC in Japan (National Institute of Informatics, Speech Resources Consortium)
![Page 11: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/11.jpg)
LREC 2006 May. 24-26 Genoa, Italy
11
Oriental COCOSDA
Proposed in 1994, to exchange ideas, share
information, discuss regional issues on SLP.
Preparatory meeting in Hong Kong in 1997.
Annual workshops held since 1998 in Japan,
Taiwan, China, Korea, Thailand, Singapore,
India, Indonesia.
![Page 12: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/12.jpg)
LREC 2006 May. 24-26 Genoa, Italy
12
Necessity of Oriental COCOSDA
Asia is a multilingual region.
Diversity of the languages is larger than Europe.
Speech researches were emerging.
Speech corpora were required.
Cooperation among countries was necessary.
Organizations for speech corpora were needed.
![Page 13: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/13.jpg)
LREC 2006 May. 24-26 Genoa, Italy
13
Oriental COCOSDA Mission
To exchange ideas, share information, discuss
regional matters on creation, utilization,
dissemination of spoken language corpora
of oriental languages, assessment methods
of speech input/output systems, and
To promote speech research on oriental
languages.
![Page 14: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/14.jpg)
LREC 2006 May. 24-26 Genoa, Italy
14
Goals of Oriental COCOSDA
1. Initiating Speech Resources Consortium
in each country.
2. Establishment of Asian Network among
the Consortia.
3. Creation of multilingual corpus of
semantically similar contents.
![Page 15: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/15.jpg)
LREC 2006 May. 24-26 Genoa, Italy
15
Strategies of Oriental COCOSDA
1. Foundation of Oriental COCOSDA
Forum of speech corpora
2. Establishment of Regional Consortia:
GSK, SITEC, Chinese LDC, CCC,
NII-SRC
3. Collaboration among the consortia
![Page 16: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/16.jpg)
LREC 2006 May. 24-26 Genoa, Italy
16
Oriental COCOSDA Organization
Convenor: Chiu-yu TSENG (2006-)
S. ITAHASHI (1998-2005)
Advisory members:
Three from China, Japan, Korea
Committee members: 21 from 10 regions including
China, Hong Kong, India, Indonesia, Japan, Korea,
Mongolia, Singapore, Taiwan, Thailand.
![Page 17: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/17.jpg)
LREC 2006 May. 24-26 Genoa, Italy
17
International Workshop on East-Asian Language Resources and Evaluation
- Oriental COCOSDA WORKSHOP -
1998 1st Meeting, Tsukuba, Japan (30 papers, 54 participants)1999 2nd Meeting, Taipei, Taiwan (44, 120)2000 3rd Meeting, Beijing, China (8, 20)2001 4th Meeting, Taejon, Korea (11, 25)2002 5th Meeting, Hua Hin, Thailand (24, 96) + SNLP2003 6th Meeting, Sentosa, Singapore (28, 60 ) + PACLIC2004 7th Meeting, Delhi, India (55, 150) + iSTEPS, iSTRANS2005 8th Meeting, Jakarta, Indonesia (24, 65)
![Page 18: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/18.jpg)
LREC 2006 May. 24-26 Genoa, Italy
18
Oriental COCOSDA Organizers
8
T.F.Zheng (China)
S.S.Agrawal(India)
Thanaruk T. (Thailand)
K.T.Lua(Singapore)
S.Itahashi(Japan)
L.S.Lee(Taiwan)
C.K.Chan(Hong Kong)
H.Riza(Indonesia)
Y-J Lee (Korea)
![Page 19: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/19.jpg)
LREC 2006 May. 24-26 Genoa, Italy
19
Participation
0. China, Japan, Korea, Taiwan (CJKTw), Hong Kong (HK)
1. CJKTw
2. CJKTw, Thailand (Th), France (F), U.S.A.
3. CJKTw, Th, Mongolia (Mg)
4. CJKTw, Th, Australia (Au)
5. CJKTw, Th, India (Id), Indonesia (Is), Guam
6. CJKTw, Th, Id, Is, Singapore (S)
7. CJKTw, Id, Is, S, Au, F, U.S.A.
8. CJKTw, Th, Is, Malaysia, Mg, HK
![Page 20: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/20.jpg)
LREC 2006 May. 24-26 Genoa, Italy
20
Some Regional Activities
JapanKorea ChinaHong KongMongoliaSingaporeTaiwanThailandIndiaIndonesia
![Page 21: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/21.jpg)
LREC 2006 May. 24-26 Genoa, Italy
21
Japanese Activities
GSK: Language Resource Association
Launched in 1999
Renovated as an NPO in 2003
Project accepted in 2005 for 3 years
Emphasizing written text corpora
NII-SRC launched in 2006 for speech corpora
![Page 22: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/22.jpg)
LREC 2006 May. 24-26 Genoa, Italy
22
Standardization in Japan
1) Open Software Tools: Julius, Galatea, etc.
2) Standard of Speech Synthesis System
Performance Evaluation Methods
by JEITA (2003)
3) Standard of Symbols for Japanese Text-To-Speech
Synthesizer
by JEIDA (2000)
JEITA: Japan Electronics and Information Technology Industries Association
JEIDA: Japan Electronic Industry Development Association
![Page 23: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/23.jpg)
LREC 2006 May. 24-26 Genoa, Italy
23
Korea
SITEC (Speech Information Technology & I
ndustry Promotion Center)
Founded in 2001 (Korean LDC/ELRA)
Wonkwang University as host organization
(7 full-time staffs)
![Page 24: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/24.jpg)
LREC 2006 May. 24-26 Genoa, Italy
24
Chinese LDC
Launched in 2002
Creation of linguistic corpora
Management & distribution of language resources
Promotion of sharing language resources
*Chinese Corpus Consortium (CCC)
![Page 25: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/25.jpg)
LREC 2006 May. 24-26 Genoa, Italy
25
Future Prospects: Global Speech Corpus
Digits, digit strings, days of the week,
months, time, salutations, yes/no, well-
known proper nouns (person names, cities,
companies), well-known stories,
phonetically-balanced sentences, etc.
common to all languages.
![Page 26: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/26.jpg)
LREC 2006 May. 24-26 Genoa, Italy
26
Utterance Content
Items widely understood in the world:
10 Digits, 12 Months of the year,
7 Days of the week, 4 Words on Weather,
6 Phrases of Greetings, 3 Words of Replies,
4 Words on time.
“North Wind” from Aesop’s Fables
![Page 27: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/27.jpg)
LREC 2006 May. 24-26 Genoa, Italy
27
Features of the proposed corpus
Containing various Asian Languages
With the same semantic content
Recorded in a sound-proof room
![Page 28: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/28.jpg)
LREC 2006 May. 24-26 Genoa, Italy
28
Future of Oriental COCOSDA
1. Collaboration among regional activities2. Cooperative creation of speech corpora3. Promotion of speech research in Asia
Future conference sites: Malaysia, Vietnam, Mongolia, Xinjang Uygur Autonomous Region of China
![Page 29: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/29.jpg)
LREC 2006 May. 24-26 Genoa, Italy
29
Conclusion
1. Importance of speech corpora for promoting speech research.
2. Role of organizations for speech corpus creation and distribution
4. GSK, SRC/SITEC/Chinese LDC, CCC are expected to further speech corpus creation and distribution together with Oriental COCOSDA in East Asia.
http://www.slc.atr.jp/o-cocosda/
![Page 30: Oriental COCOSDA: Past, Present and Future](https://reader035.vdocuments.us/reader035/viewer/2022062519/56815058550346895dbe583b/html5/thumbnails/30.jpg)
LREC 2006 May. 24-26 Genoa, Italy
30
Oriental COCOSDA 2006
9-11 Dec. 2006
Universiti Sains Malaysia
Penang, Malaysia
Abstract submission: Aug. 5
Notification of acceptance: Aug. 26
Final manuscript: Sep. 30
http://www.usm.my/o-cocosda/