archiving and the work flow of field work nicholas thieberger pacific and regional archive for...

24
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Archiving and the work flow of field work

Nicholas Thieberger

Pacific and Regional Archive for Digital Sources in Endangered Cultures

Page 2: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Nicholas Thieberger, Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)

Department of Linguistics & Applied Linguistics, The University of Melbourne Vic 3010, Australia

LSA Archiving tutorial, January 2005

Language archiving is an integral part of language documentation. The documents linguists are producing are meant to endure and to be available for the people we record and their communities, as well as for fellow researchers, well into the future, and, we hope, for ever. Archiving is no longer something we do at the end of our fieldwork, it is apparent now that it can be integrated into everyday language documentation work and that it is a crucial aspect of documentary linguistics. We have learned to separate form and content in the representation of linguistic data and recent technological advances have pointed to the importance of planning data management and workflow for ethnographic recording which in turn has facilitated an expansion in documentary linguistics and archiving. Recordings should always be of high quality, but it is in the context of small and endangered cultures and languages that the quality of recording takes on new significance (quality here refers both to the content and the form of the recording). If we are the only recorders of the last remaining speakers or performers then, right from the moment of recording, we must be concerned with making good documents which will be placed into a suitable archive for storage and discovery. Thus we can distinguish archival practice, which will be the main focus of this paper, from archival storage in a repository.

I discuss a workflow that builds in development of archival data and show that making the initial recordings and their digital representation citable by means of a persistent identifier allows further work to be located with reference to that primary data. Typically this further work involves annotation of the data and the construction of dictionaries and in all such derived material the content is plain text structured to allow it to endure into the future. Further description of the data with standard metadata terms allows its discovery in the longterm. All of this facilitates repatriation of the data to the communities from which it originates, as they are able to locate the data once it has been archived.

Archives have an image of being repositories of old stuff. Usually old stuff that comes from old people. And in our case it is old stuff from old people on old languages. I asked a colleague if he was considering depositing with our archive and he said "Did I look as if I was going to die any minute when you last saw me?" For him, as for many people it seems, archiving is something done at the end of one's career when there is time to go back and fill in gaps and make the whole data more presentable. This view of archiving has it that boxes of stuff can be delivered to the archive sometime after the linguist has finished with them and will then be held in perpetuity. The recent focus of linguistic archives, informed by the discussion of language documentation, is that the stuff deposited must be of sufficient quality and sufficiently well-described that it can be useful into the future.

Page 3: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Language Archives

An integral part of language documentation

The locus for supporting documentary linguistic activity Ark - Hive (David Nathan, ELAR)

Need to develop archival methods for linguistic fieldworkers

Page 4: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Producing archival material is something we, as ordinary working linguists (OWLs), should do all the time and, further, the possibilities provided by new technologies allow us to incorporate archival issues into our everyday practice to the benefit both of our analysis, and of the use of our recordings and intellectual outputs. Current archives are training and providing advice in response to the need for such a service in our community, that is the community of documentary linguists. These archives are primarily trusted longterm repositories that take well-structured data and provide the infrastructure for securely holding and locating it over time. An archive is also the point of reference for a network of practitioners who want advice on how to proceed. It is the archive’s role to agree on standards that seem most appropriate and to assist in their adoption by the broader community. My observation of our own and other such archives suggests we are all acting as a locus for documentary activity, and as proponents for new methods - what has been called an ‘ark-hive’ by David Nathan of ELAR. As none of us has the resources to edit items in our collections, we rely on the depositors to produce material that is well-formed from an archival point of view. Such data has an explicit structure (encoded, for example by tags (as in a Shoebox lexical file for example) or by stand-off markup (as in time-aligned transcripts). It is also provided in a non-proprietary form that can be read on any platform.

The fact that the best current working tools for transcription and time-alignment are coming out of this same effort, for example IMDI and Elan from the MPI in Nijmegen, or Transcriber from LIMSI (with strong support from OLAC via the Linguistic Data Consortium) indicates that archives are central to the promotion of new technologies as a means for ensuring that normal linguistic fieldwork will result in the best possible archival form.

Perspective of PARADISEC

This paper is written from the perspective of Paradisec, a young digital archive based in virtual space between Sydney, Melbourne and Canberra in Australia. Paradisec was established 18 months ago by a group of linguists and musicologists concerned at the lack of a repository for material recorded outside of Australia by Australian researchers. For those working with Indigenous Australian languages there is a national archive called the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS) which has been operating since the 1960s. National Australian cultural institutions like the National Library and the National Film and Sound Archive do not have a mandate to keep field recordings from outside of Australia. In particular we were concerned especially about audiotapes recorded since the 1950s that were not being stored in any suitable repository and were physically deteriorating. Thus the initial focus was on the preservation of existing, so-called ‘legacy’ material and we have so far digitised some 660 hours or 1.1 terabytes of data. However, once we started processing these tapes, it was clear that there was a huge demand from current researchers wanting to work with their data in a digital form and wanting high-quality archival representation of their media before they conducted most of their analysis.

Page 5: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

The coercive archive

Attached to a funding body

Obligatory deposit of recorded material

Enforces data formats by contractual requirement

Page 6: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

At this point it is useful to distinguish two kinds of existing archives, which I will characterise as coercive and non-coercive. The coercive archive is part of a funding agency and so has some means and abilty to enforce standards on depositors, as is the case with the ELDP/ELAR or DOBES. When grants are provided for language documentation, the form of the recordings and their associated descriptive and analytical apparatus can be prescribed by these funding bodies, who have also been providing training in the use of these methods. Further, the funding body can contractually bind the recipient to lodge this material with the archive. These archives typically house newly recorded material, often recorded in a digital form and so not requiring conversion before being archvied.

PARADISEC is currently not in a position to fund researchers, and so the appeal to depositors has to be pitched differently. We encourage practitioners (who we take to mainly include linguists, musicologists, and indigenous language workers) to deposit media material by ensuring that they will have a high quality digital version of their data in the short term. If an archival form of the file is created first and is then used as the basis for the subsequent effort of transcription and time-aligning, the resulting work has a citable source that should persist into the future. We have been encouraging postgraduate students to lodge their tapes with PARADISEC as soon as they return from fieldwork. We digitise or capture their data and provide both an archival (usually at 96khz/24bit BWF) and a representational (linear Mp3) copy with its persistent identifier in our collection. This gives them a digital file to work with, but more importantly it gives them a citable form of archival data with persistent identification. Their intellectual effort of annotating this primary data can then build on a firm foundation for both their own immediate goal (typically a dissertation) and the longterm needs of having richly annotated primary data safely archived.

I learned the hard way that using a non-archival digital file as the basis for transcription and analysis results in a mismatch in timecodes when an archival file is later produced. I had digitised my analog field cassettes myself in 1998 by connecting a tape player to a computer and producing fairly poor digital copies. I then annotated these using the program called SoundIndex from LACITO as part of my documentation of the Oceanic language South Efate. A few years later the tapes were digitised at a higher resolution and the timecodes in my earlier versions did not align in a simple way with those of the new files. I transcribed around eighteen hours of audio altogether and I was under some time constraint as the work was to result in a dissertation in the form of a documentary grammar with a time-aligned media corpus. The non-archival forms of transcript may be corrected one day, but the clear lesson is that creation of an archival form of digital data is best done before the analysis begins.

Page 7: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

At PARADISEC we also spend considerable time with many old tapes, preparing them for data transfer by cleaning and, in some cases, baking them under vaccuum. We assign persistent identification and create an enduring citable form of the data as part of the archival accession and we run training workshops of half a day to several days’ duration on the use of software tools and on data management. We use these as a means of advocating a workflow for language documentation that builds archiving into the normal everyday work of the OWL rather than being an onerous addition, or a task left until the weight of the cumulative research effort becomes unbearable at the end of a researcher’s working life.

The paucity of material related to many Australian indigenous languages is a great motivator for the current generation of researchers to ensure that the records that we leave behind will be of more use than those we have had to work with.

Both coercive and non-coercive archives rely on the relationships they have established with their communities, both depositors and users. In general, the benefits of depositing are clear, in particular as we are digitising analog tapes and holding copies at no cost for members of our consortium. The ability to be ‘trusted’, as a repository should be, arises from a number of factors, but a key for us has been the ability to provide advice and training to ensure the quality, both technical and in content, of recordings and associated derived material (transcripts, glosses, dictionaries etc). The rationale is that if we want high quality recordings and well-structured archival data, then we have to provide training in its creation. We run workshops in using Shoebox, still the only tool that creates structured lexical files, and, as wonderful tools like Transcriber and Elan are produced by our colleagues in Europe we introduce them to a community of users in our region at occasional workshops, both in our universities and in community-based language centres.

Page 8: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

http://www.linguistics.unimelb.edu.au/rnld.html

Resource Network for Linguistic Diversity

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 9: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Resource Network for Linguistic Diversity, To assist in training, we have cooperated in the establishment of a network for providing support to language workers, called the Resource Network for Linguistic Diversity, and we use the mailing list associated with this network to discuss emerging methods and tools, as well as providing a FAQ page and an archive of the list discussion (kindly provided by LinguistList).

Deposit of data in an archive presupposes that the depositor has sought and received permission from their interlocutors. Ideally a written consent form, provided by the researcher and signed by the speaker, would specify the uses to which the recordings could be put. Each item in the archive is accompanied by a deposit form filled out by the depositor or their executor that outlines conditions on use of the material.

The ability to enforce standards on depositors extends to the description of the data, or the metadata that allows the data to be discovered. Again, a coercive funding agency can insist on highly detailed metadata descriptions, as we see with the finegrained IMDI metadata set. For legacy data, that PARADISEC mainly deals with, the quality of metadata can be quite variable, often no more than a few lines on a tape box, together with contextual information about the collection from which the item will be identified. At PARADISEC we use a cataloging system that provides a description of the item as well as of the process it undergoes from accession. All of this metadata can be output in various forms, one of which is the OLAC metadata set. We would like to take this opportunity to thank OLAC for the work it has put into developing a metadata system and for the support we have received in establishing a static metadata repository that we update periodically from our catalog. Exporting to OLAC metadata has increased the visibility and so the discoverability of the material in our collection, and its ease of use has meant that we were able to move our metadata system from nothing to an Open Archives Initiative conformant metadata repository in a few months.

Page 10: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Filenaming conventions

Collection-item-file.extension

AB1-002-A.wav

AB1/AB1-002/AB1-002-A.wavAB1/AB1-002/AB1-002-A.mp3AB1/AB1-002/AB1-002-A.xmlAB1/AB1-002/AB1-002-FN1.tifAB1/AB1-002/AB1-002-FN1.jpg

Page 11: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

We also provide users with a spreadsheet with our metadata headings and are hoping to implement a web entry form for users in the next year. We encourage users to develop a persistent naming convention using fairly standard ASCII characters and to avoid unnecessarily long names. If we can then take the user’s names for their own files and incorporate them into our persistent identification it makes it much easier to keep track of the relationships between the notes and the media files. Our persistent filenames follow the directory structure of the mass storage system on which the files will reside, and are composed of a collection identifier, followed by an item identifier and then a specific local identifier (like ‘A’ or ‘B’ for the side of a tape). These are then followed by a three-letter extension indicating the filetype.

Working with legacy material means that we see what small additional steps a researcher could have taken to make their recordings more useful. Obviously collections vary greatly in the accompanying documentation. In some cases there is no specific information about the tapes we have located in a box or filing cabinet, and, while there may be accompanying fieldnotes, we do not have the time or the personnel to work through fieldnotes and to establish their relationships to field recordings. Simple descriptive metadata allows us and potential researchers to locate the relevant material and to reintegrate it with fieldnotes.

Page 12: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

<hasTranscript>/<isTranscriptOf> proposed addition to OLAC/Dublin Core <relation> element

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page image of Stephen Wurm’s fieldnotes on Aiwo, Solomon Islands, SAW2-018-00005

Page 13: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Recently we have been taking representational images of transcripts found in association with tapes recorded thirty years ago. These are typescript or handwritten manuscripts that belong together with tape recordings. We give the images the same name as the tape, differing in the extension (.wav for the audio and .jpg for the image). Furthermore, the metadata notes the relation between these two types of information (using the element <relation>, for which we propose the additional refinement of <isTranscriptOf>/ <isTranscribedBy>.

Page 14: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Everyday archivingCreation of optimal archival forms of data

through normal linguistic practice, e.g.• Well-structured data (e.g. backslash codes,

XML)• Annotation with time-alignment• Citation of archival data, which implies

- persistent identification and location- interactive use of archival data

Page 15: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Everyday archivingCreation of optimal archival forms of data

through normal linguistic practice, e.g.• Tracking relationships within the data

- speaker - tape - transcript - text • Consent/deposit forms clarify intellectual

property issues

Page 16: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

tapeanalogue digitised/

digital capturedarchival digital file

transcribed and linked (using e.g.

Transcriber or Elan)

Media corpus instantiates links to media (e.g. Audiamus)

concordance of texts, navigation tool

output to e.g. Shoebox for interlinearising

archived withPARADISEC

archived withPARADISEC

Example Workflow

Texts, dictionary etc

descriptive metadata added

Page 17: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and compare

Previously Current

Data Analog Digital

Copyright in material clarified

Rarely Consent forms signed by interlocutors (because deposit in an archive is envisaged as part of the process)

Filenames Arbitrary Persistent Identifiers

Page 18: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and comparePreviously Current

Data structure No explicit structure (implicitly marked by fonts and styles)

Explicit structure is used as the basis for derived forms (e.g. as in lexical files in Shoebox)

Archival accession of primary data

After use of the material by the researcher. (Typically post retirement or after death of the researcher.)

Incremental accession, ideally before use of the material by the researcher.

Page 19: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and comparePreviously Current

Annotation of primary media

Little done, usually by hand.

More comprehensive annotation, using time-alignment and interlinearising.

Archival accession of annotations

Typically post retirement or after death of the researcher.

Work in progress archivable and overwritten by subsequent versions (safe backup)

Page 20: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and compare

Previously Current

Persistent identification to support citation forms of data.

Maybe in fieldworker’s notes, hampered by lack of discoverability.

Assigned by archive and persistent identifier resolved to an item in the archive.

Metadata standard

Library/MARC (large existing infrastructure)

DC / OLAC (support for small, collector-based archives)

Page 21: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and compare

Previously Current

Metadata discovery

Library catalogs (not always interpoerable)

Open Archives Initiative, subject specialised searches

Persistent identification to support citation forms of data.

Maybe in fieldworker’s notes, hampered by lack of discoverability.

Assigned by archive and persistent identifier resolved to an item in the archive.

Page 22: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and compare

Previously Current

Persistence of data

Analog tape in one location

Digital simulacra/copies (LOCKSS)

Relation between items

Ignored or treated in catalog

Treated in metadata and instantiated where possible (e.g. tape/transcript)

Page 23: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Contrast and comparePreviously Current

Repatriation of copies

Copies of tapes provided from a single location.

Digital copies of tape/transcript in linked form. Available for download from the web.

Page 24: Archiving and the work flow of field work Nicholas Thieberger Pacific and Regional Archive for Digital Sources in Endangered Cultures

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Conclusion

• Secure longterm storage of well-described linguistic records is crucial to language documentation.

• Archives do not have the resources to prepare data.

• Training is essential for the integration of an archival sensibility into a linguist’s fieldwork methods.

• It is up to the fieldworker to produce archival material from their fieldwork.