digital data archives in the humanities
Post on 13-Jan-2016
39 Views
Preview:
DESCRIPTION
TRANSCRIPT
Pacific and Regional Archive for Digital Sources in Endangered Pacific and Regional Archive for Digital Sources in Endangered CulturesCultures
Digital data archives in the Digital data archives in the humanitieshumanities
Linda Barwick, University of SydneyLinda Barwick, University of SydneyAPAN Semantic Web workshop, Bangkok, 27 January 2005APAN Semantic Web workshop, Bangkok, 27 January 2005
Issues for participating in the semantic webIssues for participating in the semantic webthe case of PARADISECthe case of PARADISEC
Endangered Endangered languageslanguages•Over 2000 of the world’s 6000 Over 2000 of the world’s 6000
languages in the Asia-Pacific regionlanguages in the Asia-Pacific region
•Number likely to fall to a few Number likely to fall to a few hundred by 2100 (UNESCO)hundred by 2100 (UNESCO)
•Australian researchers active in Australian researchers active in region since 1950s - making unique region since 1950s - making unique recordings of unrepeatable eventsrecordings of unrepeatable events
•Recordings now themselves Recordings now themselves endangered (format obsolescence, endangered (format obsolescence, media deterioration, loss of media deterioration, loss of metadata)metadata)
PARADISEC’s PARADISEC’s missionmission•To preserve and make accessible Australian To preserve and make accessible Australian
researchers’ field recordings of endangered researchers’ field recordings of endangered languages and musics from the Asia-Pacific languages and musics from the Asia-Pacific regionregion
•Preservation: to adopt world’s best Preservation: to adopt world’s best practice standards and formats to practice standards and formats to maximise sustainability and future maximise sustainability and future useability of the collectionuseability of the collection
•Access: To take advantage of emerging Access: To take advantage of emerging information and communication information and communication technologies to maximise access to our technologies to maximise access to our collection by both researchers and collection by both researchers and cultural heritage communitiescultural heritage communities
PARADISEC PARADISEC structurestructure
CIs: Cliff GoddardCIs: Cliff GoddardHugh de FerrantiHugh de Ferranti
CIs: Steve BirdCIs: Steve BirdNick EvansNick EvansCathy FalkCathy Falk
Janet FletcherJanet FletcherJohn HajekJohn Hajek
CIs: Andrew PawleyCIs: Andrew PawleyJohn BowdenJohn Bowden
Malcolm RossMalcolm RossAlan RumseyAlan Rumsey
Project ManagerProject Manager(Metadata guru)(Metadata guru)Nick ThiebergerNick Thieberger
Audio Archiving UnitAudio Archiving UnitDirector: Linda BarwickDirector: Linda BarwickAudio: Frank DaveyAudio: Frank DaveyProject Liaison: Amanda HarrisProject Liaison: Amanda Harris
Store account - web interfaceStore account - web interfaceStuart HungerfordStuart Hungerford
CIs: William FoleyCIs: William FoleyAllan MarettAllan MarettJane SimpsonJane Simpson
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
NetworkingNetworking
•Main campuses (University of Sydney, University of Main campuses (University of Sydney, University of Melbourne, Australian National University) Melbourne, Australian National University) connected by Grangenet (next generation research connected by Grangenet (next generation research network, 10Gbps connections)network, 10Gbps connections)
•Pay subscription, not traffic costsPay subscription, not traffic costs
•Satellite campus UNE connected by AARnet Satellite campus UNE connected by AARnet (Australian research and education network - (Australian research and education network - currently billed traffic cost, 155Mbps connection)currently billed traffic cost, 155Mbps connection)
•Both with connections to APAN community (Asia Both with connections to APAN community (Asia Pacific Advanced Networks) - potential for linking Pacific Advanced Networks) - potential for linking to regional and international R&E networks - to regional and international R&E networks - potential traffic costs an issuepotential traffic costs an issue
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
StorageStorage
•Australian Partnership for Advanced Computing National Australian Partnership for Advanced Computing National Facility Mass Data Storage System - Hierarchical Storage Facility Mass Data Storage System - Hierarchical Storage Manager systemManager system
•Funded by consortium of Australian higher education Funded by consortium of Australian higher education bodiesbodies
•Tape robot system - can handle 1.2PBTape robot system - can handle 1.2PB
•PARADISEC will add 2-3TB per year once satellite ingest PARADISEC will add 2-3TB per year once satellite ingest commissionedcommissioned
•Current horizon of facility 2008 - project PARADISEC Current horizon of facility 2008 - project PARADISEC collection up to 9TB by thencollection up to 9TB by then
•Will need to apply to host material/share data from Will need to apply to host material/share data from other collectionsother collections
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
SoftwareSoftware•Initial metadata database in Filemaker Pro 6 Initial metadata database in Filemaker Pro 6
with periodic XML dumps for OLAC static with periodic XML dumps for OLAC static harvestingharvesting
•Currently being ported to MySQL/PHP to Currently being ported to MySQL/PHP to allow dynamic harvesting and other allow dynamic harvesting and other functionalityfunctionality
•Python software for managing repository and Python software for managing repository and website (Stuart Hungerford, ANU)website (Stuart Hungerford, ANU)
•Developing Java-based geographic search Developing Java-based geographic search interface (TimeMap)interface (TimeMap)
•All based on Open Source toolsAll based on Open Source tools
Audio IngestAudio Ingest
•Initially ingested as raw WAV on Initially ingested as raw WAV on AudioCube 5 Dell AudioCube 5 Dell 670 workstations 670 workstations running Wavelab (2005 will add running Wavelab (2005 will add remote Pyramix workstations)remote Pyramix workstations)
•Masters 24-bit 96khz Broadcast WAV Format Masters 24-bit 96khz Broadcast WAV Format (uncompressed audio with encapsulated (uncompressed audio with encapsulated metadata)metadata)
•Some lower rate (e.g. if digital original Some lower rate (e.g. if digital original 16bit 48khz from DAT)16bit 48khz from DAT)
•WAV > BWF by Quadriga audio archiving softwareWAV > BWF by Quadriga audio archiving software
•derivatives produced by batch processing - CD-derivatives produced by batch processing - CD-audio quality (16-bit, 44.1khz) and mp3 audio quality (16-bit, 44.1khz) and mp3 quality(128bps)quality(128bps)
Digital Digital preservationpreservation
•““Azoulay” server partitioned for working files and Azoulay” server partitioned for working files and archive partition for sealed masters - current archive partition for sealed masters - current capacity 750GB (>3TB in 2005)capacity 750GB (>3TB in 2005)
•Sealed masters archived to 100GB data tapes on Sealed masters archived to 100GB data tapes on University of Sydney LTO Mass Data Storage University of Sydney LTO Mass Data Storage System (high-low watermark script) - duplicate System (high-low watermark script) - duplicate data tapes kept at 2 locations on campusdata tapes kept at 2 locations on campus
•Sealed masters mirrored to Australian Partnerhsip Sealed masters mirrored to Australian Partnerhsip for Advanced Computing national Store facility for Advanced Computing national Store facility (Canberra) nightly (Canberra) nightly
•Password-protected online access to Store facilityPassword-protected online access to Store facility
Data repository Data repository contentscontents
•Repository totals 21 January 200Repository totals 21 January 20055
•total files: 2714total files: 2714
•total items: 8total items: 84477
•total size: 1.total size: 1.11TBTB
•total hours audio: total hours audio: 668668 hours hours
•file types: .wav, .mp3 (1file types: .wav, .mp3 (1210210); .tif, ); .tif, ((171171), .jpg (46), .pdf (34), .txt ), .jpg (46), .pdf (34), .txt (3), .rtf (8), .xml (32)(3), .rtf (8), .xml (32)
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Data rData repository epository ccollectionsollectionsBradley (5hr)Bradley (5hr)
Brotchie (15hr)Brotchie (15hr)Capell (9hr)*Capell (9hr)*Corris (6hr)Corris (6hr)Crowther (2hr)Crowther (2hr)Donohue (3hr)Donohue (3hr)Dutton (266hr)Dutton (266hr)Fedden (7hr)Fedden (7hr)Foley (23hr)Foley (23hr)Gardner (56hr)Gardner (56hr)Kartomi (2hr)*Kartomi (2hr)*Loughnane (9hr)Loughnane (9hr)Lawton (3hr)Lawton (3hr)Laycock (29hr)Laycock (29hr)
McElhanon (41hr) McElhanon (41hr) McIntyre (10hr)McIntyre (10hr)Margetts (17hr)Margetts (17hr)Poignant (2hr)Poignant (2hr)Rumsey (20hr)Rumsey (20hr)San Roque (1hr)San Roque (1hr)Sam (6hr)Sam (6hr)Tepano (19hr)Tepano (19hr)Thieberger (39hr)Thieberger (39hr)Toulmin (35hr)Toulmin (35hr)Voorhoeve (33hr)Voorhoeve (33hr)Wurm (11)*Wurm (11)*Evans (Hons thesis)Evans (Hons thesis)Thieberger (PhD thesis)Thieberger (PhD thesis)
* Ingestion * Ingestion ongoing ongoing January 2005January 2005
PDSC Jan 2005 AB1
AC1
AM2
AM3
AM4
AR1
AR2
BE1
CLV1
DB1
DG3
DL1
KM1
LS1
LSR1
MC1
MC2
MD1
MK2
MT1
NT1
P130_19
RL1
RL2
RP1
SAW2
SF1
TD1
TT1
WF1
PAPUA N. GUINEAAbauAmbonese PidginAngoram (Kanduanuin)Angoram (Moim dialect)AomieArapeshArifamaAunaleiAuwimAwomoBaBalawaiaBaraiBarugaBarupu (Warapu)Be'aniviaBiageBiboBinandereBodinumuBoeraBoineBokuBoridiBouxulaBratMomireBuinBurumChimbaChirimaDagaDaravaDawawaDedua
DimaDimadimaDinaDogaDomuDoromuDouraEfogiEfogi DialectsEmoEnivilogoForeFuyugeyGabadiGinumanGwedenaHereiHiae MotuHiri MotuHubeHulaI'aiIkegaIomaIsaka (Krisa)KaipiKairiKambotKangaKaramaKarawari Lg (Ambinwari)KarukaruKâte
KinalakngaKimiKiriwinaKoiariKoitaKoitabuKokilaKokoroKombaKoparKorikiKorikoKosorongKovaiKovioKubuirubuKumanKumukioKuniKunimaipaKwaleLaimodoMada'aMagiMâgobinengMagoreMaisinMaiwaManagalasManamManubaraManumuMapeiMapena
MariMariaMekeoMelpaMianMid-WahgiMigabacMindikMiniafaMogoniMomMorMotuMuhiang ArapeshNabakNagaNamanadzaNaoroNaraNew Ireland PidginNgalaNomuNotuOndoroOne (Onne)OnjabOnoOpaoOrokaivaOrokoloOumaPaiwaPolice MotuPorome
Qld PidginRabukaRaepa TatiSalibaSamoSeneSepik Tok PisinSialumSinaugoroSonaSuauSukuSuraiTaboroTairumaTauadeToboTok PisinTolaiUberiUbirUbir GonjoeVesilogoVioribaiwaWamoraWangunWigaWoseraYele.YewuduYimasYoba
COOKISLANDSRarotonganPukapuka
FRENCHPOLYNESIATahitian
CHILE >>>Rapa Nui
PALAUPalauan
SOLOMONSBabatanaRirioRuvianaVareseLauSanta Cruz
INDONESIAAsmatBratHatamInanwatanManikionMoiNingrumSahuSebyarTinamTodaheTok PisinYahadian
.
PARADISEC Repository Languages PARADISEC Repository Languages November 2004November 2004
INDIARajbangsi
NEW CALEDONIADehu
VANUATUSouth EfateBislamaLelepa
FIJILauan
TONGATongan
Sample item interfaceSample item interface
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Sample item interfaceSample item interface
Sample catalog Sample catalog metadatametadata
Metadata January 2005Metadata January 2005•1800 items (recordings or theses) digitised 1800 items (recordings or theses) digitised
or assessed for digitisation (1629 findable or assessed for digitisation (1629 findable online via metadata repository)online via metadata repository)
•254 languages from 39 countries in 254 languages from 39 countries in Asia-PacificAsia-Pacific
•Cassettes: 1256 hoursCassettes: 1256 hours
•Reel-to-reel tapes: 417,356 metres of Reel-to-reel tapes: 417,356 metres of tapetape
•Video: 356 hoursVideo: 356 hours
Open Language Archives Community Open Language Archives Community (OLAC)(OLAC)
http://www.language-archives.orghttp://www.language-archives.org
•Sub-communitySub-community of of Open Archives Open Archives InitiativeInitiative
•Worldwide virtual Worldwide virtual library of language library of language resources resources
•PARADISEC one of PARADISEC one of 29 participating 29 participating archivesarchives
AIMSAIMS
•develop consensus on develop consensus on best current practice for best current practice for digital archiving of digital archiving of language resourceslanguage resources
•develop network of develop network of interoperating interoperating repositories & services repositories & services for housing & accessing for housing & accessing such resources such resources
Metadata OLAC harvestMetadata OLAC harvest
lacito.vjf.cnrs.fr/archivagelacito.vjf.cnrs.fr/archivage www.uaf.edu/anlc/www.uaf.edu/anlc/
emeld.orgemeld.org
www.ailla.utexas.orgwww.ailla.utexas.org
paradisec.org.auparadisec.org.au
www.arts.auckland.ac.nz/antwww.arts.auckland.ac.nz/antwww.aiatsis.gov.auwww.aiatsis.gov.au
www.hrelp.org/archive/www.hrelp.org/archive/
www.mpi.nl/DOBESwww.mpi.nl/DOBES
DELAMAN connections www.delaman.orgDELAMAN connections www.delaman.org
General Ontology for Linguistic General Ontology for Linguistic DescriptionDescription
Music Description Music Description Ontologies?Ontologies?
•Much more complicated situation because of Much more complicated situation because of commercial music industry interestscommercial music industry interests
•Most ontologies designed for commercial music Most ontologies designed for commercial music (albums, tracks, composers etc ) or Western (albums, tracks, composers etc ) or Western music notation (diatonic scale etc)music notation (diatonic scale etc)
•Most recent ethnomusicological discourse Most recent ethnomusicological discourse concentrates on social context rather than concentrates on social context rather than description or analysis and suspicious of description or analysis and suspicious of universalist approachesuniversalist approaches
•Some current initiatives e.g. EU MusicNetworkSome current initiatives e.g. EU MusicNetwork
Issues for semantic Issues for semantic webweb
•Small-scale specialist archive with few staff and Small-scale specialist archive with few staff and precarious funding - not resourced for huge amount precarious funding - not resourced for huge amount of work for RDF markupof work for RDF markup
•Curator-intensive - cannot be readily automatedCurator-intensive - cannot be readily automated
• Need to motivate and involve researchers and Need to motivate and involve researchers and communities in description as well as high-level ICT communities in description as well as high-level ICT advisorsadvisors
•Present highest priority salvage of endangered Present highest priority salvage of endangered mediamedia
•Lack of appropriate ontologies especially for musicLack of appropriate ontologies especially for music
But…But…•We have a good foundation - well-structured data We have a good foundation - well-structured data
and metadata (for whole-item level) conforming and metadata (for whole-item level) conforming to international standardsto international standards
•We are in conversation with international We are in conversation with international disciplinary communities through OLAC, EMELD, disciplinary communities through OLAC, EMELD, DELAMANDELAMAN
•Our collection is of high cultural heritage and Our collection is of high cultural heritage and scholarly value, of interest to international scholarly value, of interest to international communitycommunity
•We are motivated to learn more from other large-We are motivated to learn more from other large-scale distributed digital data archivesscale distributed digital data archives
PARADISEC gratefully PARADISEC gratefully acknowledges support acknowledges support
from:from:•Partner Universities (Sydney, Melbourne, Partner Universities (Sydney, Melbourne, ANU, UNE)ANU, UNE)
•Australian Research Council LIEF schemeAustralian Research Council LIEF scheme
•Australian Partnership for Sustainable Australian Partnership for Sustainable Repositories (SORRT testbed)Repositories (SORRT testbed)
•Australian Partnership for Advanced Australian Partnership for Advanced ComputingComputing
•GrangenetGrangenet
•ANU Internet FuturesANU Internet Futures
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Contact usContact us• http://www.paradisec.org.auhttp://www.paradisec.org.au
•Linda.Barwick@paradisec.org.au Linda.Barwick@paradisec.org.au (Director)(Director)
•Nicholas.Thieberger@paradisec.org.Nicholas.Thieberger@paradisec.org.au (Project Manager)au (Project Manager)
Relevant URLsRelevant URLs
•PARADISEC website PARADISEC website http://paradisec.org.au/http://paradisec.org.au/
•PARADISEC repository login PARADISEC repository login http://store.http://store.apacapac..eduedu.au/cgi-bin/pdsc-v3.0..au/cgi-bin/pdsc-v3.0.cgi/logincgi/login
•PARADISEC streaming trial PARADISEC streaming trial http://paradisec.org.au/streamingtrial.htmlhttp://paradisec.org.au/streamingtrial.html
•Transcript page image trial Transcript page image trial http://www.austehc.unimelb.edu.au/~gavan/lana/http://www.austehc.unimelb.edu.au/~gavan/lana/hdms.hdms.htmhtm
•EMELD General Ontology for Linguistic DescriptionEMELD General Ontology for Linguistic Descriptionhttp://www.emeld.org/tools/ontology.cfmhttp://www.emeld.org/tools/ontology.cfm
top related