mpi tla din a0 2011-11-16

1
The Language Archive Language Data Experts Collaboration Tools Projects Max Planck Institute for Psycholinguistics TLA‘s Mission • digitize and archive language resources • support access to language resources • develop tools, services and infrastructures • set up of regional archives worldwide • organize education and training activities • give help and support The Language Archive Max Planck Institute for Psycholinguistics P.O. Box 310, 6500 AH Nijmegen Wundtlaan 1, 6525 XD Nijmegen The Netherlands Phone: (+31) (0)24 - 352 19 11 Fax: (+31) (0)24 - 352 12 13 eMail: [email protected] www.mpi.nl/tla State of the Archive • 60+ terabyte, 500.000+ files • 73.000+ metadata sessions • 20.000+ hours audio/video recordings • 60.000+ annotation files • 4.5 million+ annotated segments • 45+ lexica • speech, multimodal, acquisition, multilingual, language and cognition, brain imaging, ethnological and other data TLA is jointly funded by the Max-Planck-Society, the Berlin-Brandenburg Academy of Sciences and the Royal Netherlands Academy of Arts and Sciences With substantial contributions by the Volkswagen- Foundation, the European Commission, the German Ministry for Education and Research, the Dutch Science Foundation and the Max Planck Institute for Psycholinguistics. November 2011 Language Archiving Technology LAT TLA builds on a large archive of language resources, including primary data (multimedia recordings), secondary data (annotation, lexica, comments, etc.), and metadata. To prevent its loss, the Archive is copied to various locations including a growing number of regional archives, preserving relations, contexts and provenance information. To take care of the interpretability of data in the long run, adherence to standards and a continuous curation procedure are very important. Access to the data in the spirit of the Live Archives idea and regulated by a code of conduct and other agreements is guaranteed to those who have access permissions to the individual resources which are defined in four levels (fully open to closed) by the depositors. Besides the fieldwork data of about 60 DOBES projects, TLA continues to digitize and archive an increasing amount of other language related data. Currently there are data on more than 200 languages in the archive. Archive Technology The LAT software suite, started in 2000 with the multi- media annotation tool ELAN and the IMDI metadata infrastructure, covers about 15 components and tools. It is continuously being debugged, adapted and extended. It includes tools for Resource Creation & Organization (ELAN, LEXUS, IMDI/CMDI, ARBIL, AV Recognizers), tools for Management, Upload & Infrastructure (LAMUS, IMDI/ CMDI, AMS, COSIX, HANDLE, REPLIX), and tools for basic and complex resource access (IMDI/CMDI, VLO, ANNEX, IMEX, LEXUS, GIS, TROVA, VICOS). 2 Computer Centers in Munich (one from MPG) 2 Computer Centers in Göttingen (one from MPG) 2 Copies MPI Nijmegen Activities TLA is involved in a number of initiatives devoted to the archiving of digital language data, to the improvement of technologies to create, manage and access language data, and to the construction of infrastructures that facilitate cross-institutional and cross-corpora access. The resulting infrastructures will allow researchers to build virtual collections and workflows to improve data access in the direction of eHumanities usage scenarios. TLA also contributes to standardization in ISO TC37/SC4 (www.tc37sc4.org) which aims at facilitating interoperability in the language resources domain. Past Projects: MUMIS, INTERA, ISLE, LIRICS, DAM-LR (all EC), CGN (NWO), HARVE, INTER, ROR (all MPG), REPLIX, (DEISA, CLARIN-EU). Running Projects: DOBES (VWS), CLARIN (NL, DE), DASISH, INNET, CLARA, EUDAT (all EC), AVATecH, (MPG-FhG), RELISH (DFG/NEH). preparation integration utilization RELcat / ISOcat Ontology management framework Archive federation Infrastructures Data Life Cycle Support Data Archiving and Copying IMDI / CMDI / GIS / VLO Metadata Browsing & Searching IMDI / CMDI / ARBIL Data Organization Metadata Description ELAN / LEXUS Annotation + Lexicon ANNEX / LEXUS / IMEX TROVA Complex Access via Web VICOS Semantic Access and Enrichment LAMUS Data Uploading and Management Access Management DOKUMENTATION BEDROHTER SPRACHEN DOCUMENTATION OF ENDANGERED LANGUAGES D O B E S DéĮine Beaver Hoocąk Wichita Chontal Lacandón Aikanã/Kwazá Tsafiki People of the Center Cashinahua Baure Movima Yuracaré Uru-Chipaya Chaco Languages Marquesan Tuamotuan Minderico Bainouk Laal Beezen Bubia / Isubu Bakola Tima Oyda = | Akhoe Hai||om Taa Lower Sorbian Kola-Sámi Enets / Nenets Svan / Udi / Tsova-Tush Gorani Khinalug Semoq Beri / Batek Semang Totoli Waima‘a Wooi Teop Saliba / Logea Savosavo Vurës / Vera‘a Iwaidja Jaminjung Nen/Tonda Ambrym Languages Tofa Even Salar / Monguor Chintang / Puma Tangsa / Tai / Singpho Kurumba Languages Sri Lanka Malay Katxuyana Mawé Trumai Kuikuro Awetí Bakairí Ache Regional archives DOBES MPI Archive

Upload: others

Post on 12-Nov-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MPI TLA DIN A0 2011-11-16

TheLanguage ArchiveLanguage

Data

Experts Collaboration

Tools

Projects

Max Planck Institute for Psycholinguistics

TLA‘s Mission• digitizeandarchivelanguageresources

• supportaccesstolanguageresources

• developtools,servicesandinfrastructures

• setupofregionalarchivesworldwide

• organizeeducationandtrainingactivities

• givehelpandsupport

TheLanguageArchiveMax Planck Institute for PsycholinguisticsP.O.Box310,6500AHNijmegenWundtlaan1,6525XDNijmegenTheNetherlandsPhone: (+31)(0)24-3521911Fax: (+31)(0)24-3521213eMail: [email protected]/tla

State of the Archive•60+terabyte,500.000+files

•73.000+metadatasessions

•20.000+hoursaudio/videorecordings

•60.000+annotationfiles

•4.5million+annotatedsegments

•45+lexica

•speech,multimodal,acquisition,multilingual,languageandcognition,brainimaging,ethnologicalandotherdata

TLA is jointly funded by the Max-Planck-Society, the Berlin-Brandenburg Academy of Sciences and the Royal Netherlands Academy of Arts and Sciences

WithsubstantialcontributionsbytheVolkswagen-Foundation,theEuropeanCommission,theGermanMinistryforEducationandResearch,theDutchScienceFoundationandtheMaxPlanckInstituteforPsycholinguistics.

Nov

ember201

1

Language Archiving Technology LAT

TLAbuildsonalargearchiveoflanguageresources,includingprimarydata(multimediarecordings),secondarydata(annotation,lexica,comments,etc.),andmetadata.Topreventitsloss,theArchiveiscopiedtovariouslocationsincludingagrowingnumberofregionalarchives,preservingrelations,contextsandprovenanceinformation.

Totakecareoftheinterpretabilityofdatainthelongrun,adherencetostandardsandacontinuouscurationprocedureareveryimportant.AccesstothedatainthespiritoftheLive Archivesideaandregulatedbyacodeofconductandotheragreementsisguaranteedtothosewhohaveaccesspermissionstotheindividualresourceswhicharedefinedinfourlevels(fullyopentoclosed)bythedepositors.

Besidesthefieldworkdataofabout60DOBESprojects,TLAcontinuestodigitizeandarchiveanincreasingamountofotherlanguagerelateddata.Currentlytherearedataonmorethan200languagesinthearchive.

Archive

Technology

TheLATsoftwaresuite,startedin2000withthemulti-mediaannotationtoolELANandtheIMDImetadatainfrastructure,coversabout15componentsandtools.Itiscontinuouslybeingdebugged,adaptedandextended.

ItincludestoolsforResourceCreation&Organization(ELAN,LEXUS,IMDI/CMDI,ARBIL,AVRecognizers),toolsforManagement,Upload&Infrastructure(LAMUS,IMDI/CMDI,AMS,COSIX,HANDLE,REPLIX),andtoolsforbasicandcomplexresourceaccess(IMDI/CMDI,VLO,ANNEX,IMEX,LEXUS,GIS,TROVA,VICOS).

2ComputerCentersinMunich(onefromMPG)

2ComputerCentersinGöttingen(onefromMPG)

2CopiesMPINijmegen

Activities

TLAisinvolvedinanumberofinitiativesdevotedtothearchivingofdigitallanguagedata,totheimprovementoftechnologiestocreate,manageandaccesslanguagedata,andtotheconstructionofinfrastructuresthatfacilitatecross-institutionalandcross-corporaaccess.TheresultinginfrastructureswillallowresearcherstobuildvirtualcollectionsandworkflowstoimprovedataaccessinthedirectionofeHumanitiesusagescenarios.TLAalsocontributestostandardizationinISOTC37/SC4(www.tc37sc4.org)whichaimsatfacilitatinginteroperabilityinthelanguageresourcesdomain.

PastProjects:MUMIS,INTERA,ISLE,LIRICS,DAM-LR(allEC),CGN(NWO),HARVE,INTER,ROR(allMPG),REPLIX,(DEISA,CLARIN-EU).RunningProjects:DOBES(VWS),CLARIN(NL,DE),DASISH,INNET,CLARA,EUDAT(allEC),AVATecH,(MPG-FhG),RELISH(DFG/NEH).

preparation

integration

utilization

RELcat / ISOcat Ontology

managementframework

Archivefederation

Infrastructures

Data Life Cycle Support

Data Archiving and Copying

IMDI / CMDI / GIS / VLO

MetadataBrowsing&Searching

IMDI / CMDI / ARBILDataOrganization

MetadataDescription

ELAN / LEXUS

Annotation+Lexicon

ANNEX / LEXUS / IMEX TROVA

ComplexAccessviaWeb

VICOS

SemanticAccessandEnrichment

LAMUSDataUploadingandManagement

AccessManagement

Dokumentation BeDrohter Sprachen Documentation oF enDanGereD LanGuaGeS DOBES

DéĮine

Beaver

Hoocąk

Wichita

Chontal

Lacandón

Aikanã/Kwazá

Tsafiki

People of the Center

Cashinahua

Baure

Movima

Yuracaré

Uru-Chipaya

Chaco Languages

Marquesan

Tuamotuan

Minderico

Bainouk

Laal

Beezen

Bubia / Isubu

Bakola

Tima

Oyda

=| Akhoe Hai||om

Taa

Lower Sorbian

Kola-Sámi

Enets / Nenets

Svan / Udi / Tsova-Tush

Gorani

Khinalug Semoq Beri / Batek

Semang

Totoli

Waima‘a

Wooi

Teop

Saliba / Logea

Savosavo

Vurës / Vera‘a

Iwaidja

Jaminjung

Nen/Tonda

Ambrym Languages

Tofa

Even

Salar / Monguor

Chintang / Puma

Tangsa / Tai / Singpho

Kurumba Languages

Sri Lanka Malay

Katxuyana

Mawé

Trumai

Kuikuro

Awetí

Bakairí

Ache

Regional archives

DOBES

MPI

Archive