the rosetta project all language archive a project of the long now foundation & a national...

40
The Rosetta Project ALL Language Archive A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org Presented by: Laura Buszard- Welcher The Rosetta Project / University of California, Berkeley

Upload: silvester-gibbs

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

The Rosetta ProjectALL Language Archive

A Project of the Long Now Foundation & A National Science Digital Library

www.rosettaproject.org

Presented by:

Laura Buszard-Welcher

The Rosetta Project / University of California, Berkeley

Primary Goals• Support the documentation of the world’s nearly

7000 languages through building– A digital archive of language documentation– A linguistically sophisticated site that is also useful and

interesting for the general public– Networks of speakers, educators, linguists

• Contributes to the effort to document endangered languages

• Promotes linguistic diversity by educating the public about languages with small numbers of speakers.

Secondary Goals• Support metadata standardization and

interoperability– OLAC

– EMELD

• Develop tools for collaborative linguistic research– Endangered Language Query Room

– Wordlist Tool

– Collaborative document editing/creation (new site)

Roles• The Long Now Foundation

– Parent organization of The Rosetta Project– Projects, seminars on topics that foster long term thinking

• The National Science Digital Library– U.S. National Science Foundation Program– Goal is to bring online high quality STEM (Science,

Technology, Engineering, and Math) resources for education– Sponsor of Rosetta Project (NSF 333727)

• Stanford University– Online and offline storage of Rosetta materials

The Long Now Foundation

The National Science Digital Library

Stanford University Libraries

Project History:The 1000 Language Archive

• Initiated by The Long Now Foundation• Wanted to experiment with new

microetching technology, looking for suitable content

• Decided to collect basic descriptive information for 1000 of the world’s approximately 7000 languages

Why language information?• Most natural human languages are products of

millenia of human history (therefore a good long term thinking project)

• Repositories of cultural information• Languages showcase

– Human intellectual sophistication– Cultural diversity

• To draw attention to the critical issue of language endangerment

The Rosetta Disk

• Next generation microfiche• Micro-etched 2" nickel disk at

densities of up to 200,000 page images per disk

• Developed by Los Alamos Laboratories and Norsam Technologies

• Reading the disk requires a microscope, either optical or electron, depending on the density of encoding

The Rosetta Stone

• Not us! (196 BC)• Parallel text written in

three scripts:– Hieroglyphic– Demotic (script form)– Greek

• The key to deciphering Egyptian Hieroglyphs

Rosetta Stone Language LearningSoftware

(Also not us!)

Design of the Disk

• Original design has human-eye readable text (Genesis text) and micro-etched text inside an index

• New design has human-eye readable text (instructions) on one side and microetched images on the reverse

In-House Scanning• HP CapShare Scanners• Scan printed page in

multiple passes, any direction

• Page is ‘assembled’ into one image

• Stores about 50 pages at a time (300 dpi bitmap .tif)

• Uploads numerically sequenced images to computer by infrared port

In-House Scanning• Minolta PS 7000 Overhead• Bitmap and grayscale scans up

to 600 dpi• Multiple sizes, orientations• Single page / double page

spread (good for text collections with verso annotations)

• Best for fragile books, manuscripts that would be damaged by hand scanning

Categories of Collection (1)Ethnologue description

General information from www.ethnologue.com about language affiliation, where spoken, number of speakers, dialects, alternate language names.

General description

General description of the language. Origin and current distribution of language, number of speakers, family, typology, history, etc.

Maps Maps of the geographic distribution of a language and its relationship to other languages in the region.

Orthography Writing system(s) of the language with any accompanying guide to pronunciation, use, etc.

Phonology A description of the basic sound units in a language (phonemes) and how they combine to form utterances.

Categories of Collection (2)Grammar How a language combines the smallest units of meaning

(morphemes) to create words and words to create sentences.

Core Word Lists

A common word list of 100 or 200 terms typically collected in linguistic fieldwork (“Swadesh Lists”), often used for comparative purposes.

Numbers A description of the numbering system(s) in a language with a list of basic terms.

Parallel Texts A common text with translation for each language. Initially Genesis Chapters 1-3 (a commonly collected text). Now also the UN Declaration of Human Rights.

Glossed Texts Transcribed indigenous texts with word glosses, free translations and grammatical markup.

Language Curation

Ethnologue Description Grammar (1167)

General Description (1651) Core Word Lists (3098)

Maps (376) Numbering Systems (215)

Orthography (1052) Main Parallel Texts (1109)

Phonology (1731) Glossed Vernacular Texts (869)

Rosetta Project Web Site

• Welcome

• Search for a language

• Language overview page

• Browse (by name, family, country)

• Wordlist tool

Welcome

Search

Language Overview

Browse

Projects

• Endangered Language Query Rooms

• Digital Online Curation Services for Endangered Language Archives (DOCS)

• Wordlist Tool

• LangGator

Endangered Language Query Rooms

http://emeld.rosettaproject.org/

Query Room Virtual Keyboard

Potawatomi Query RoomRe: Bozhoby Donald Perrot (host) on July 9 2004, 8:53 PMNmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe' e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I live in Escanaba, MI.

Re: Bozhoby Justin Neely on September 7 2004, 1:16 PMBozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas

[Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in Escanaba. Bye for now, Zagnenibi.]

Taking Conversational Risksby [TL] on July 17 2004, 10:30 AMmbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek. wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat. wabek nin gezhe ni demojgeyan gnebech. bama mine mtego

[I went to the lake yesterday. My brother brought a canoe so we could float around all day. Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too, maybe. So long for now, Mtego.]

Re: onago egi zhejkeyakby [JN] on July 17 2004, 8:12 PMmbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se

[I should go to the lake today. The water is cold here. I wish the water were warm. I’ll write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]

Factors in query room successNias Potawatomi

Speech community 500,000 ~25 native

Robust use On Nias Nowhere

Diaspora Indonesia, West US, Ontario

Internet access Only in diaspora US-normal

Online community Preexisting Preexisting

Rooms requested By speaker By speaker

DOCS Project• Digital Online Curation

Services for Endangered Language Archives

• Many small language archives are beginning to digitize their materials

• Lack technical infrastructure to bring resources online

• Goal is to provide access through Rosetta

DOCS Project Archives• Endangered Language Fund (ELF)

• Survey for California and Other American Indian Languages (SCOIL)

• The Alaska Native Languages Center (ANLC)

• Max-Planck Institute for Evolutionary Anthropology (Leipzig)

Wordlist Tool• Swadesh lists (100, 200, 207 terms) from:

– Tryon's Comparative Austronesian Dictionary (rekeyed)– Tim Usher's Indo-Pacific database (2002 version)– Paul Whitehouse's Australian and New Guinea database (2002

version)– George Starostin's Dravidian database– Ilya Peiros' Mon Khmer database

• Total of 1,384 languages, 3,090 lists online• Additional 3000 lists, up to 1850 terms per

list, most 300-500 words in length.

LangGator• A linguistic “Wayback Machine”• Language resource location and aggregation

– Use alternate language names, spellings• Deutsch, Hochdeutsch, High German, Allemande• Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca

– Character identification (inventory, distribution)• Dera (Chadic, Nigeria)• Dera (Trans-New Guinea, Indonesia)

– Seed crawler with Wordlist terms (see previous slide), weighted towards longer terms

• Archiving through Internet Archive• Serve results through the Rosetta site

Collaborations

• Electronic Metastructure for Endangered Languages Data (E-MELD)

• General Ontology for Linguistic Description (GOLD)

• Open Language Archives Community (OLAC)

E-MELD• Electronic Metastructure for Endangered Language

Data• School of Best Practice http://emeld.org/school/index.html

– Guidelines and examples for putting linguistic data into best practice digital formats

– XML with XML Schema or DTD– Mapping terminology to ontology (GOLD)

• FIELD lexical database tool http://emeld.org/tools/field/beta/

– Online collaborative tool to build linguistic dictionaries, backed by ontology (GOLD)

GOLD• General Ontology for Linguistic Description• Built in OWL (Web Ontology Language), linked to

SUMO (Suggested Upper Merged Ontology)• Best practice resources should include a mapping

between the researcher’s terms, and a standard set, known as the ‘profile’– ‘independent’ (mine) = ‘main clause’ (GOLD)– ‘obviative’ (mine) = ‘fourth person’ (GOLD)

• The standard terminology set can then allow sophisticated searches across disparate resources.

GOLD Community Model

OLAC• Open Language Archives Community• Set of 23 metadata elements and controlled

vocabularies (based on Dublin Core)– Subject.language (language described, rather than audience

language) uses SIL language codes– Type.linguistic (grammar, lexicon, text)– IMDI (Isle Metadata Initiative) has 135 elements

• Recommended extensions (Discourse Types, Linguistic Field, Participant roles

• Enables searches across a network of archives that use OLAC metadata http://www.language-archives.org/tools/search/

URLs• Electronic Metastructure for Endangered Language Data (E-MELD)

http://www.emeld.org (School of Best Practice, FIELD Tool).• Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/.• The Ethnologue http://www.ethnologue.com.• General Ontology for Linguistic Description (GOLD) http://www.linguistics-

ontology.org.• ISLE MetaData Initiative (IMDI) http://www.mpi.nl/IMDI/.• National Science Digital Library (NSDL) http://nsdl.org • Open Language Archives Community (OLAC) http://www.language-

archives.org.• The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new

Web site (currently under construction) is available at http://preview.rosettaproject.org.

Credits• This project is funded by the US National

Science Digital Library (NSF 333727)