the rosetta project all language archive a project of the long now foundation & a national...
TRANSCRIPT
The Rosetta ProjectALL Language Archive
A Project of the Long Now Foundation & A National Science Digital Library
www.rosettaproject.org
Presented by:
Laura Buszard-Welcher
The Rosetta Project / University of California, Berkeley
Primary Goals• Support the documentation of the world’s nearly
7000 languages through building– A digital archive of language documentation– A linguistically sophisticated site that is also useful and
interesting for the general public– Networks of speakers, educators, linguists
• Contributes to the effort to document endangered languages
• Promotes linguistic diversity by educating the public about languages with small numbers of speakers.
Secondary Goals• Support metadata standardization and
interoperability– OLAC
– EMELD
• Develop tools for collaborative linguistic research– Endangered Language Query Room
– Wordlist Tool
– Collaborative document editing/creation (new site)
Roles• The Long Now Foundation
– Parent organization of The Rosetta Project– Projects, seminars on topics that foster long term thinking
• The National Science Digital Library– U.S. National Science Foundation Program– Goal is to bring online high quality STEM (Science,
Technology, Engineering, and Math) resources for education– Sponsor of Rosetta Project (NSF 333727)
• Stanford University– Online and offline storage of Rosetta materials
Project History:The 1000 Language Archive
• Initiated by The Long Now Foundation• Wanted to experiment with new
microetching technology, looking for suitable content
• Decided to collect basic descriptive information for 1000 of the world’s approximately 7000 languages
Why language information?• Most natural human languages are products of
millenia of human history (therefore a good long term thinking project)
• Repositories of cultural information• Languages showcase
– Human intellectual sophistication– Cultural diversity
• To draw attention to the critical issue of language endangerment
The Rosetta Disk
• Next generation microfiche• Micro-etched 2" nickel disk at
densities of up to 200,000 page images per disk
• Developed by Los Alamos Laboratories and Norsam Technologies
• Reading the disk requires a microscope, either optical or electron, depending on the density of encoding
The Rosetta Stone
• Not us! (196 BC)• Parallel text written in
three scripts:– Hieroglyphic– Demotic (script form)– Greek
• The key to deciphering Egyptian Hieroglyphs
Design of the Disk
• Original design has human-eye readable text (Genesis text) and micro-etched text inside an index
• New design has human-eye readable text (instructions) on one side and microetched images on the reverse
In-House Scanning• HP CapShare Scanners• Scan printed page in
multiple passes, any direction
• Page is ‘assembled’ into one image
• Stores about 50 pages at a time (300 dpi bitmap .tif)
• Uploads numerically sequenced images to computer by infrared port
In-House Scanning• Minolta PS 7000 Overhead• Bitmap and grayscale scans up
to 600 dpi• Multiple sizes, orientations• Single page / double page
spread (good for text collections with verso annotations)
• Best for fragile books, manuscripts that would be damaged by hand scanning
Categories of Collection (1)Ethnologue description
General information from www.ethnologue.com about language affiliation, where spoken, number of speakers, dialects, alternate language names.
General description
General description of the language. Origin and current distribution of language, number of speakers, family, typology, history, etc.
Maps Maps of the geographic distribution of a language and its relationship to other languages in the region.
Orthography Writing system(s) of the language with any accompanying guide to pronunciation, use, etc.
Phonology A description of the basic sound units in a language (phonemes) and how they combine to form utterances.
Categories of Collection (2)Grammar How a language combines the smallest units of meaning
(morphemes) to create words and words to create sentences.
Core Word Lists
A common word list of 100 or 200 terms typically collected in linguistic fieldwork (“Swadesh Lists”), often used for comparative purposes.
Numbers A description of the numbering system(s) in a language with a list of basic terms.
Parallel Texts A common text with translation for each language. Initially Genesis Chapters 1-3 (a commonly collected text). Now also the UN Declaration of Human Rights.
Glossed Texts Transcribed indigenous texts with word glosses, free translations and grammatical markup.
Language Curation
Ethnologue Description Grammar (1167)
General Description (1651) Core Word Lists (3098)
Maps (376) Numbering Systems (215)
Orthography (1052) Main Parallel Texts (1109)
Phonology (1731) Glossed Vernacular Texts (869)
Rosetta Project Web Site
• Welcome
• Search for a language
• Language overview page
• Browse (by name, family, country)
• Wordlist tool
Projects
• Endangered Language Query Rooms
• Digital Online Curation Services for Endangered Language Archives (DOCS)
• Wordlist Tool
• LangGator
Potawatomi Query RoomRe: Bozhoby Donald Perrot (host) on July 9 2004, 8:53 PMNmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe' e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I live in Escanaba, MI.
Re: Bozhoby Justin Neely on September 7 2004, 1:16 PMBozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas
[Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in Escanaba. Bye for now, Zagnenibi.]
Taking Conversational Risksby [TL] on July 17 2004, 10:30 AMmbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek. wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat. wabek nin gezhe ni demojgeyan gnebech. bama mine mtego
[I went to the lake yesterday. My brother brought a canoe so we could float around all day. Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too, maybe. So long for now, Mtego.]
Re: onago egi zhejkeyakby [JN] on July 17 2004, 8:12 PMmbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se
[I should go to the lake today. The water is cold here. I wish the water were warm. I’ll write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]
Factors in query room successNias Potawatomi
Speech community 500,000 ~25 native
Robust use On Nias Nowhere
Diaspora Indonesia, West US, Ontario
Internet access Only in diaspora US-normal
Online community Preexisting Preexisting
Rooms requested By speaker By speaker
DOCS Project• Digital Online Curation
Services for Endangered Language Archives
• Many small language archives are beginning to digitize their materials
• Lack technical infrastructure to bring resources online
• Goal is to provide access through Rosetta
DOCS Project Archives• Endangered Language Fund (ELF)
• Survey for California and Other American Indian Languages (SCOIL)
• The Alaska Native Languages Center (ANLC)
• Max-Planck Institute for Evolutionary Anthropology (Leipzig)
Wordlist Tool• Swadesh lists (100, 200, 207 terms) from:
– Tryon's Comparative Austronesian Dictionary (rekeyed)– Tim Usher's Indo-Pacific database (2002 version)– Paul Whitehouse's Australian and New Guinea database (2002
version)– George Starostin's Dravidian database– Ilya Peiros' Mon Khmer database
• Total of 1,384 languages, 3,090 lists online• Additional 3000 lists, up to 1850 terms per
list, most 300-500 words in length.
LangGator• A linguistic “Wayback Machine”• Language resource location and aggregation
– Use alternate language names, spellings• Deutsch, Hochdeutsch, High German, Allemande• Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca
– Character identification (inventory, distribution)• Dera (Chadic, Nigeria)• Dera (Trans-New Guinea, Indonesia)
– Seed crawler with Wordlist terms (see previous slide), weighted towards longer terms
• Archiving through Internet Archive• Serve results through the Rosetta site
Collaborations
• Electronic Metastructure for Endangered Languages Data (E-MELD)
• General Ontology for Linguistic Description (GOLD)
• Open Language Archives Community (OLAC)
E-MELD• Electronic Metastructure for Endangered Language
Data• School of Best Practice http://emeld.org/school/index.html
– Guidelines and examples for putting linguistic data into best practice digital formats
– XML with XML Schema or DTD– Mapping terminology to ontology (GOLD)
• FIELD lexical database tool http://emeld.org/tools/field/beta/
– Online collaborative tool to build linguistic dictionaries, backed by ontology (GOLD)
GOLD• General Ontology for Linguistic Description• Built in OWL (Web Ontology Language), linked to
SUMO (Suggested Upper Merged Ontology)• Best practice resources should include a mapping
between the researcher’s terms, and a standard set, known as the ‘profile’– ‘independent’ (mine) = ‘main clause’ (GOLD)– ‘obviative’ (mine) = ‘fourth person’ (GOLD)
• The standard terminology set can then allow sophisticated searches across disparate resources.
OLAC• Open Language Archives Community• Set of 23 metadata elements and controlled
vocabularies (based on Dublin Core)– Subject.language (language described, rather than audience
language) uses SIL language codes– Type.linguistic (grammar, lexicon, text)– IMDI (Isle Metadata Initiative) has 135 elements
• Recommended extensions (Discourse Types, Linguistic Field, Participant roles
• Enables searches across a network of archives that use OLAC metadata http://www.language-archives.org/tools/search/
URLs• Electronic Metastructure for Endangered Language Data (E-MELD)
http://www.emeld.org (School of Best Practice, FIELD Tool).• Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/.• The Ethnologue http://www.ethnologue.com.• General Ontology for Linguistic Description (GOLD) http://www.linguistics-
ontology.org.• ISLE MetaData Initiative (IMDI) http://www.mpi.nl/IMDI/.• National Science Digital Library (NSDL) http://nsdl.org • Open Language Archives Community (OLAC) http://www.language-
archives.org.• The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new
Web site (currently under construction) is available at http://preview.rosettaproject.org.