1
David NathanEndangered Languages Archive
SOAS University of London
LingDyFeb 15, 2013
ELAR and Digital Archiving for Documentation of
Endangered Languages
2
What is a digital language archive?
a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material
has policies and processes for acquiring, cataloguing, preserving, disseminating, and migrating (updating formats)
a platform for building and supporting relationships between data providers and data users
4
Why is language archiving different?
what is a language? unlike business data, it is not
conventionalised (like $, age, year of publication etc) – what and how to code?
varying and competing expectations
5
And endangered languages archiving?
extremely diverse context – languages, cultures, communities, individuals, projects
typical source - fieldworkers typical materials - documentation difficult for archive staff to manage sensitivities and restrictions
6
What can a language archive offer?
Security - keep your electronic materials safe Preservation - store your materials for the
long term Discovery - help others to find out about your
materials, and you to find out about users Protocols - respect and implement
sensitivities, restrictions Sharing - share results of your work, if
appropriate Acknowledgement - create citable
acknowledgement Mobilisation - create usable language
materials Quality and standards - advice for assuring
your materials are of the highest quality and robust standards
7
There are different kinds of language archives
from local to global - different coverage, contexts, methods, collection policies
consider placing your materials in more than one …
there are also sites for aggregating different archives’ holdings, eg Virtual Language Observatory, OLAC
8
Why digital?
preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss
also good for cataloguing, sharing, dissemination, repurposing
9
Digital disadvantages
digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get right preservation depends on file and data
formats depend on tools and software some formats require particular software
(can we archive the software?) formats: prefer standard, stable, open,
explicit, long-lasting some materials may have to be
‘migrated’
10
What do depositors have to do?
select and contact an archive prepare materials
select structure suitable encodings and formats complete metadata,
metadocumentation, agreements send materials to archive(s) work with archive during curation etc ongoing management, updating,
dissemination
11
OAIS model
OAIS archives define three types of ‘packages’ingestion, archive, dissemination:
Archive Dissemination
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
IngestionProducers Designated communities
12
ELAR - architecture
reduced boundaries between depositors, users and archive: users add, update content;
negotiate accessArchive
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
afd_34
dfa dfadf
fds fdafds
&
Users Producers
request
give access
contribute
edit
13
Redefining the digital EL archive
a platform for developing and conducting relationships between knowledge producers and knowledge users – a social networking archive
level the playing field between researchers and community members/other stakeholders
encourage, recognise and cater for diversity
14
Data management and archiving
use good data management practices whether or not you plan to archive materials document decisions, steps, conventions,
structures, encodings appropriate and conventional data
encoding methods (e.g. Unicode) be explicit and consistent plan for flowing data, working with
others, across different systems (cf Bird and Simons, ‘Seven Dimensions of Portability’)
good data management practices will make a future archiving process easier and better
15
Users and potential users
depositors – deposit, access or update materials
speakers and their descendants other researchers -
comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc
other “stakeholders”, eg educationalists, funders
journalists and the wider public
16
ELAR facts and figures
archived collections: ~200 online (published) collections: 150 average collection size about 80 GB online data bundles: ~25,000 online bundles access: unrestricted
10,000, restricted 15,000 total number of files held: around 200,000 total volume of files held: around 10 TB registered users: ~800 annual number of website "hits": 230,000
17
ELAR facts and figures – users
increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish
comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her".
many interdisciplinary researchers, particularly archivists and anthropologists
19
Why is this important?
over 50% of the world’s 7000 languages: are endangered likely to cease to be spoken this
century little or nothing known about the
majority of them language documentations and the
archives that support, preserve, and disseminate them, will become the means of transmission of many languages
20
A perfect storm?
documentation methods exposesensitivities & vulnerabilities
documentation performed by and for linguists and “others”
“big data” – resources channeledto analysis, broader audiences
“open data” – push for unmoderated access
21
Protocol
the sensitivities and access restrictions associated with EL resources
need to be discussed, collected and recorded in the field
22
Protocol and access control
principles: granularity – file, bundle or collection access is a relation between object and
user protocol values can be changed over
time ELAR’s URCS system
User Researcher Community member Subscriber
23
ELAR’s protocol values
U – resource available to all registered users
R – resource available to users registered as researchers
C – resource available to users endorsed as members of relevant language community
S – resource available to users who have been given individual access rights for that resource
27
User xx has just applied for access to restricted material in the deposit solega-107128. The following message was attached to the application:
"Hi [depositor],
Please delegate me for access to the material on Solegas."
Subscription application: formal
28
This email is to inform you that user xx's application for access to restricted material in the deposit musgrave2007tulehu has justbeen approved. The depositor included the following note to the user:
"The researcher is known to me personally and I know that his interest is legitimate."
Subscription response: formal
29
User xx has just applied for access to restricted material in the deposit budd2008beirebo. The following message was attached to the application:
"I'm xx. I like to learn Bislama language, but never heard what it sounds like. Am very curious "
Subscription application: “curious”
30
User xx has just applied for access to restricted material in the deposit verstraete2010paman. The following message was attached to the application:
"I am currently doing my masters in Linguistics and I'm researching on an endangered language in Malaysia. I would like to see a sample of the data from the fieldwork since I'm not use to this yet. I hope that I can gain more understanding in carrying out the fieldwork."
Subscription application: establish credentials and reason
31
This email is to inform you that user xx's application for access to restricted material in the deposit verstraete2010paman has just been rejected. The depositor included the following note:
"Dear xx,I am sorry we cannot give you access to this deposit. The Lamalama community has asked us to restrict access to community members.
With best wishes,
[depositor]"
Subscription response: rejected, with reason
32
This email is to inform you that user xx’s application for access to restricted material in the deposit caballero2009raramuri has just been approved. The depositor included the following note to the user:
"Please let me know if you're looking for any specific materials or if you have any questions."
Subscription response: offering further help
33
This email is to inform you that user xx's application for access to restricted material in the deposit kunbarlang-389 has just been approved. The depositor included the following note to the user:
"Hi xxI've approved your access to this collection, but you should know that there is an update in the material I've just deposited, with much more information on both music and texts. I'd be happy to give you access to that when it is processed.
Next time I come to London (October or November this year) I'd be happy to meet up if you would like to discuss."
Response: further info and offer to meet
34
What can you archive (at ELAR)?
media - audio, video graphics - images, scans texts - fieldnotes, grammars,
description, analysis structured data - aligned and
annotated transcriptions, databases, lexica
metadata, metadocumentation - contextual information about the materials, both structured and unstructured
35
Archive objects
an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined
like other archives, ELAR uses a set principle, we call “bundles” (like DoBeS’ sessions)
See bundles at ELAR
36
Archive objects
ELAR
Collection Collection Collection Collection
BundleBundle Bundle Bundle
File File File File File
37
resource(s) for an endangered language it could be just one file
catalogue / metadata deposit form view
existing deposits can also be updated, added to, and metadata added/modified
What is required to make a deposit?
38
Archive material should be selected
example: Depositor’s question: How much video can I archive?
answer: ...
39
How can I deliver data?
hard disks we return them we also send them out
flash cards and USB sticks email
good for samples for evaluation OK for most text materials
Dropbox etc a web upload facility may be provided
one day we can download from your server
40
What about CDs and DVDs?
we have found CDs, andespecially DVDs, to bevery unreliable DVD fail rate > 10%
cause confusion as filesare allocated to fit on disks, not according to corpus structure
create a lot of work for depositors and for ELAR
41
Express yourself - Metadata
metadata is data about data containers data about data
its functions• for identification, management,
retrieval of data• provides the context and
understanding of that data carries those understandings into
the future, and to others
42
Express yourself - Metadata
metadata reflects the knowledge and practices of data providers
… and therefore defines and constrains audiences and usages for the data
all value-adding to recordings of events (annotations transcriptions, translations, glosses, comments, interpretations, part of speech tagging etc) can be considered metadata
data and metadata lie on a spectrum and depend on how they are used rather than being absolutely different things
43
Express yourself - Metadata
distinguish between metadata scheme (eg set of
categories) and the way that scheme is expressed
ID audio transcription
1 TRS00065.wav bjt_02.txt
2 TRS00066.wav krs_43.txt
<sessions><session id=”1”>
<audio>TRS00065.wav </audio><transcription>bjt_02.txt</transcription>
</session><session id=”2”>
<audio>TRS00066.wav</audio><transcription>krs_43.txt</transcription>
</session></sessions>
tagged
relationalfilename: sessions.xls
filename: sessions.xml
45
Express yourself - Metadata
example you could choose categories from
OLAC, IMDI etc schemes or formulate your own
this would be a scheme of logical categories (speaker, location, date etc)
you could express these in different language(s)
you could structure the categories and values in different ways, eg as spreadsheet, database, XML
46
Express yourself - Metadata
you need to choose a set of metadata categories applying
across whole collection
+ metadata categories that apply to
particular types of objects (eg transcriptions, video), or to individual objects
+ ways of expressing and encoding all
that metadata
50
Potential sources of metadata
deposit form spreadsheets MS Word tables, CSV etc IMDI and OLAC XML files custom XML notes, correspondence and reports filenames direct input to ELAR interface audio files images (/captions) meta-metadata files
About 80% of most frequently occurring categories can be mapped to OLAC
20 languageSubject.language17 date Date17 descriptionDescription16 id Identifier16 speaker Contributor16 title Title15 format Format13 type Type12 creator Creator12 file name Identifier12 notes11 rights Rights10 duration Coverage9 content Description9 contributorContributor9 name Contributor9 relation Relation
8 age8 comment8 genre Type.linguistic8 subject.languageSubject.language7 date recorded Date7 document 17 gender7 place Coverage6 directory Identifier5 location Coverage5 rec_date Date5 recorder Contributor
term OLAC term OLAC
53
Depositors also add categories such as:
detailed locations metadata in Spanish indigenous genres and titles (eg of songs) parents’ and spouse’s mother tongues,
birthplaces number of children, their language
competence L2, L3 and competencies languages heard clan/moiety occupation education level
54
… more metadata:
date left home country photos (/captions) of consultants, field
sessions etc equipment microphone workflow status naming and organisational codes and
principles recorder/linguist experience level biography and project description
(“meta-documentation”)
57
0
5
10
15
20
25
langua
ge
spea
ker
crea
tor
dura
tion
relatio
n
subje
ct.lan
guage
place
reco
rder
rec_
locat
ion elan
med
ia
occu
patio
n
subje
ct
abstr
act
code
com
municativ
e_ev
ent:
file_b
undle:
vide
o_file
cont
ribut
orau
thor
diale
ct
equip
men
t
file_b
undle
: aud
io_file
indig
enou
s title
item
date
med
ia file
read
me
sess
ion_n
ame
toolb
ox id
imag
e_file
name
acto
r.dea
fnes
s.stat
us
acto
r.fam
ily.de
af.pr
imaryc
omm
unica
tion fn
filepa
th
spee
ch so
und
name of
the i
tem
(in
spanis
h/engli
sh)
62
Discussion and conclusions
for endangered language documentation, the metadata framework is to be discovered, not predefined (cf Jeff Wallman, TBRC)
63
MD and resource discovery
“discovery” is not neutral: what is emphasized/distilled? who gains? who does the work?
MD is also about the distribution of labor and resources
64
MD and users
MD is more responsible for the form, presentation, and usage of documentation than generally acknowledged
MD should be equally accessible to and relevant for community members – it may even be more relevant to them than any “linguistic” data
65
OLAC: Open Language Archives Community:
IMDI: ISLE Metadata Initiative more categories, software specific
ELAR: for endangered language documentation, metadata framework is to be discovered, not predefined
Common metadata standards
TitleIdentifierCreatorContributorLanguageSubject.language
DateDescriptionFormatTypeRightsCoverageRelation
66
Types of metadata
people metadata – creator’s / participants’ details
descriptive metadata – content of data administrative metadata – eg. who did
what when, relationships between objects, IPR and permissions
structural metadata – how collection and its objects are organised, associated, formatted
preservation metadata – character encoding, file format
access and usage protocols
67
Examples
example - XLS example - XML example – key example – key XML example – summary and requests example - notes
68
Meta-documentation
Nathan (2010): “think of metadata as meta-documentation, the documentation of your data itself, and the conditions (linguistic, social, physical, technical, historical, biographical) under which it was produced. Such meta-documentation should be as rich and appropriate as the documentary materials themselves.”
69
Meta-documentation
identity of stakeholders involved, and their roles attitudes of language consultants, towards their
languages and towards the documenter and documentation project
relationships with consultants and community (Good 2010 mentions what he called ‘the 4 Cs’: ‘contact, consent, compensation, culture’);
goals and methodology of researcher, including research methods and tools, corpus theorisation (Woodbury 2011), theoretical assumptions behind annotation, potential for revitalisation
70
Meta-documentation
project and researcher biography: knowledge and experience of the researcher and consultants (eg. researcher’s knowledge at beginning of project, what training researcher and consultants received)
for funded projects: grant application, reports, email communications
agreements entered into – formal or informal (eg. Memorandum of Understanding, compensation arrangements), and promises made to stakeholders
relationships between this and other projects
71
Formats/encoding
format choices at these levels: representation of information representation of characters how characters are assembled into
files (file formats)
72
Characters
use UTF-8 (aka Unicode ISO 10646) be aware of using characters outside ASCII
(common US keyboard characters) – these can break if UTF-8 is not used
distinguish character encoding and fonts (a font is simply a set of images for a “character set”) something may be coded perfectly in
UTF-8 but there is no suitable font applied
some fonts may display special characters correctly but this does not mean that encoding is correct
73
File formats
audio WAV (what if original is not WAV??) resolution: 16 bit, 44.1KHz, stereo or
better video
changing frequently MPEG4 or MTS/H264/AVCH aspect, resolution: depends on project get advice from achive before
depositing
74
File formats
images TIFF **OR** original from device resolution: archive quality is 300dpi
or better
75
File formats
text best is plain text PDF/A often acceptable, may pose
problem if MS-Word or ODF, check with archive
structured data (spreadsheets, databases original format should be supplied provide a preservable derivative as well
(eg csv, PDF) common linguistic software (ELAN,
Transcriber, Toolbox, Praat etc) their file formats are generally
preservable
76
Can I still use MS Word?
ELAR no longer accepts MS Word files but Word is still useful
quicker to type up useful tables, functions, macros etc
solutions think “text only” tables as spreadsheets (are they bad
too?) (advanced) complex materials formatted
as styles, then export as marked up PDF/A – but not a perfect solution
77
My cells have multiple values!
example: keywords this is probably OK, as keywords are
atomic just consistently use a suitable
delimiter e.g. use comma - if data values
cannot have commas ELAR recommends double pipe “||”
78
My cells have multiple values!
example: speakers in a recording speakers are probably not atomic –
they have other attributes create a separate “speakers” sheet give each speaker an ID (number or
initials) use the IDs in the original sheet, with
delimiter (implements one to many) (advanced) or make another sheet to
associate recordings with speakers (implements many to many)
79
Standards
we have already mentioned some standards – UTF-8, WAV etc
there are other relevant standards, eg ISO 639-3 (language/dialect names) metadata systems
you can also establish project-local standards, eg to handle special characters (eg \e =
schwa) data field names document them! – for your usage and
for correspondence to wider standards