digital asset management and publication with ladybird eric james programmer/analyst library it yale...
TRANSCRIPT
Digital Asset Management and Publication with LadyBird
Digital Asset Management and Publication with LadyBird
Eric Jamesprogrammer/analyst library ITYale University [email protected]
12 July 2013
Eric Jamesprogrammer/analyst library ITYale University [email protected]
12 July 2013
What is LadyBird?What is LadyBird?
• Bebop song by Tadd Dameron• First Lady, Lyndon B. Johnson presidency• Old dog from King of the Hill• Digital asset management tool
2
LadyBird - Digital Asset Management ToolLadyBird - Digital Asset Management Tool
3
LadyBird from its origin is a system which processes metadata and temporarily houses digital assets to be published. It provides a configurable system for migrating digital objects and collections, normalizing metadata, and preserving and publishing content.
It was initially writing in Microsoft .Net and C#, hosted on Windows 2008 using Microsoft SQL Server 2008.
Some work on java modules (for import)Wish list – To migrate to Jruby/rails.
LadyBird componentsLadyBird components
• Web interface• Job processing engine - imports• Export processing engine – exports• Bag creation• Heartbeat monitor• Application cleanup system
• This presentation will focus on the workflow and concepts involved in publication of digital objects w/ metadata to fedora
4
LadyBird concepts ILadyBird concepts I
• Core of the application is the object table• Collection – departments within the library and Yale
(later will come into play when discussing c# tables)• Project – projects specific to a collection• An object belongs to a project and a project belongs
to a collection• Currently 16 collections with 34 projects and 1.53
million objects• We call objects “oids”, technically “oid” means object
id column of the object table but we tend to use it to describe the whole ball of wax
• User table – cataloger is registered and roles and permissions setting are used throughout the app
5
LadyBird concepts II LadyBird concepts II
• Processing objects is all about the spreadsheet• Each row is an object• Each column represents either functions or metadata
• Functions ex – {F1} is the object as identified by oid(primary key of object table), if left blank that is signal to create a new oid
• {F4} parent oid (for complex objects)• {F40} can have a value PUBLISH telling ladybird to auto publish this object
• Metadata ex – {FDID=58} call number,{FDID=262} Host,creator,etc.
The cataloger can take advantage of excel functionality (like repeating fields) to quickly create a spreadsheet for batch import,
6
LadyBird concepts IIILadyBird concepts III
field_definition (fdid) table (230 metadata fields)
51 Cataloger52 Record source53 Record date54 Record modified date55 Record ID56 Local record ID57 Local record ID, other58 Call number59 Accession number60 Box
The values are either strings or acid values (more on acids later)
7
LadyBird concepts IVLadyBird concepts IV
• Import tables – all about the spreadsheets, though you can import MARC or EAD records by bibid, barcode, handle too, in that case the records are deserialized into fdids, and any spreadsheet data overrides the records
im_job (1 master row for spreadsheet)Im_job_exHead (column headers from spreadsheet)im_job_contents (values)Im_files(for files)import_checksum (for files)im_job_contents_history
• Job tracking (overall tracking associates a oid imported to a specific job)
trk_projecttrk_jobtrk_job_contentstrk_oid
8
LadyBird concepts VLadyBird concepts V
• The C# tables – c for “current”,# for each collection• The “Metadata home” - data imported to the im tables finally transferred here• There is a set of tables for each collection.
Ex: # = 13 (collection:Hydra, project: Hydra Test)c13 – master list of oids
c13_stringsc13_longstringsc13_acid
Each row contains basically a oid/fdid/value, thus given an oid one could get all metadata fields for that object as rows from this table. It also has a favid for additional values associated with the fdid.
There also corresponding p# tables, p for “past” that keep a audit trail of any updates to specific oids.
C#table designed for high volumeExploring better options, hashing
9
LadyBird concepts VILadyBird concepts VI
• Acid – authority control – a system for using controlled vocabulary for metadata fields
Fdid 62 = Host, Creator
Acid fdid value126434 62 Luhan, Mabel Dodge, 1879-1962126626 62 Dobbs, Arthur, 1689-1765126628 62 Filson, John, ca. 1747-1788126630 62 Thomson, Charles, 1729-1824126632 62 Hutchins, Thomas, 1730-1789126635 62 Adair, James, ca. 1709-1783
So If for an oid row in the spreadsheet the fdid 62 column was given the value 126635, that field would resolve to Adair, James, ca. 1709
Currently 155,415 values.Potential for more sophisticates uses with linked data.
10
LadyBird sample workflow startLadyBird sample workflow start
• Workstation mounted with a job folder for both import and export
Windows:\\birdcage.library.yale.edu\project25\import\
Mac: SMB://birdcage.library.yale.edu/project25//import//
Windows:\\birdcage.library.yale.edu\project25\export\
Mac:SMB://birdcage.library.yale.edu/project25//export//
• Project25 corresponds to the project table• Create a folder in the import directory and drag files into folders or subfolders• LadyBird will now have detected that folder and have created a job for this under
the “Dashboard” menu selection
11
LadyBird dashboardLadyBird dashboard
12
add digital object to folderadd digital object to folder
13
Got to dashboard and process this folderGot to dashboard and process this folder
14
Receive email confirmationReceive email confirmation
Subject: LadyBird Import Complete job: test_open_rep
Your import has been processed.test_open_repVisit your dashboard in Ladybird for your most recent jobs.http://ladybird.library.yale.edu/user_jobs.aspx
View job: http://ladybird.library.yale.edu/user_jobs.aspx?qa=query&qid=12307
* A jobcomplete.txt file with the time is added to import folder so app know that directory is complete
15
View jobView job
16
View setView set
17
New object->Metadata (form)New object->Metadata (form)
18
Or From View Set, “Export as Job”Or From View Set, “Export as Job”
19
Receive export email confirmationReceive export email confirmation
Subject: LadyBird Export Ready
Your export is ready. \\birdcage\project25\export\ermadmix_46371_06262013_165116.xls
20
Spreadsheet – fill in and save as tab-delimited text fileSpreadsheet – fill in and save as tab-delimited text file
21
ImportImport
22
Import Email ConfirmationImport Email Confirmation
Subject: LadyBird Import Complete job: ermadmix_import_062613_171134
Your import has been processed.ermadmix_import_062613_171134Visit your dashboard in Ladybird for your most recent jobs.http://ladybird.library.yale.edu/user_jobs.aspx
View job: http://ladybird.library.yale.edu/user_jobs.aspx?qa=query&qid=12313
23
PublishPublish
• Publishes automatically if {F40}=publish• Or can use interface to check file and metadata and
explicitly click the publish button
24
Publish (behind the scenes)Publish (behind the scenes)
• Oid is added to the hydra table with date (when added) and date published (when processing complete) timestamps
Id oid date date published
… … … …39176 10684347 2013-06-26 16:01:11.043 2013-06-26 17:14:05.90039177 10684348 2013-06-26 16:01:11.043 2013-06-26 17:14:07.45739178 10684349 2013-06-26 16:01:11.043 2013-06-26 17:14:09.01739179 10684350 2013-06-26 16:01:11.043 2013-06-26 17:14:10.57739180 10684351 2013-06-26 16:01:11.043 2013-06-26 17:14:12.13739181 10684352 2013-06-26 16:01:11.043 2013-06-26 17:14:13.697… … … …
25
oid added to hydra_publish tableoid added to hydra_publish table
Key fields:hpid: 23703hcmid: 2cid:9Pid: 27Oid: 10681633_oid: 0zindex: 0hydraID: nulldateReady: 2013-06-26 16:01:55.430dateHydraStart: null
26
Rows for oid added to hydra_publish_path tableRows for oid added to hydra_publish_path table
Key fields w/ example:hppid: 139004Hpid: 26340Type: jp2pathHTTP: http://lbfiles.library.yale.edu/10684274.jp2pathUNC: \\storage.yale.edu\home\ladybird-801001-yul\ladybird\project27\publish\
dl\10684274\1758.02.00.00_page1.jp2Md5: 35433b00ca9de2cdaed275c455339090controlGroup: MmimeType: image/jp2Dsid: jp2ingestMethod: filepathoidPointer: null
27
Hydra_publish_path – typical filesHydra_publish_path – typical files
xml rights (hydra rights)Xml metadata (MODS descMetadata)Xml access (home grown granular rights)pdf (transcript YIPP)pdf2 (annotated transcript YIPP)jp2 (derivative)jpg (derivatives)tif (master)
28
descMetadata - creationdescMetadata - creation
There is a service (c# class and methods) that is called upon hydra publish that iterates through all the fdids for an oid and uses the XML DOM to create a MODS file. This is basically a mapping of field definitions to the MODS schema.
There is the potential to map the fdids to any metadata format.
29
accessMetadataaccessMetadata
30
Rights metadataRights metadata
31
Transition in fedora hydra worldTransition in fedora hydra world
select * from hydra_content_model
id date uid contentModel1 2013-04-25 08:50:20.043 1 simple2 2013-04-25 08:50:26.350 1 complexParent3 2013-04-25 08:50:30.420 1 complexChild
ContentModel maps to ActiveFedora model
32
Transition into fedora hydra world IITransition into fedora hydra world II
select * from hydra_content_model_ds
id date uid hcmid dsid ingMethod required1 2013-04-25 08:56:11.6701 1 accessMetadata pullHTTP y2 2013-04-25 08:56:11.6701 1 descMetadata pullHTTP y3 2013-04-25 08:56:11.6701 1 rightsMetadata pullHTTP y4 2013-04-25 08:56:11.6701 1 tif filepath y5 2013-04-25 08:56:11.6701 1 jp2 filepath y6 2013-04-25 08:56:11.6701 1 jpg filepath y7 2013-04-25 08:56:11.6701 2 accessMetadata pullHTTP y8 2013-04-25 08:56:11.6701 2 descMetadata pullHTTP y9 2013-04-25 08:56:11.6701 2 rightsMetadata pullHTTP y10 2013-04-25 08:56:11.6701 2 tif filepath n11 2013-04-25 08:56:11.6731 2 jp2 filepath n12 2013-04-25 08:56:11.6731 2 jpg filepath n13 2013-04-25 08:56:11.6731 3 accessMetadata pullHTTP y14 2013-04-25 08:56:11.6731 3 descMetadata pullHTTP y15 2013-04-25 08:56:11.6731 3 rightsMetadata pullHTTP y16 2013-04-25 08:56:11.6731 3 tif filepath y17 2013-04-25 08:56:11.6731 3 jp2 filepath y18 2013-04-25 08:56:11.6731 3 jpg filepath y19 2013-05-31 10:48:25.6201 2 oidPointer pointer n20 2013-06-07 11:03:24.5371 2 pdf filepath n21 2013-06-07 11:03:52.9331 2 pdf2 filepath n
33
Example - simple content modelExample - simple content model
• require "active-fedora"• class Simple < ActiveFedora::Base• belongs_to :collection, :property=> :is_member_of• • has_metadata :name => 'descMetadata', :type => Hydra::Datastream::SimpleMods• has_metadata :name => 'accessMetadata', :type => Hydra::Datastream::AccessConditions• has_metadata :name => 'rightsMetadata', :type => Hydra::Datastream::Rights • has_metadata :name => 'propertyMetadata', :type => Hydra::Datastream::Properties• • delegate :oid, :to=>"propertyMetadata", :unique=>true• delegate :projid, :to=>"propertyMetadata", :unique=>true• delegate :cid, :to=>"propertyMetadata", :unique=>true• delegate :zindex, :to=>"propertyMetadata", :unique=>true• delegate :parentoid, :to=>"propertyMetadata", :unique=>true•
• end
34
Example – Properties DatastreamExample – Properties Datastream
• require 'active_fedora'• • module Hydra• module Datastream• class Properties < ActiveFedora::OmDatastream •
• #ERJ note ladybird pid = projid, ladybird _oid = parentoid • set_terminology do |t|• t.root(:path=>"root")•
• t.oid(:path=>"oid")• t.cid(:path=>"cid")• t.projid(:path=>"projid")• t.zindex(:path=>"zindex")• t.parentoid(:path=>"parentoid")• t.ztotal(:path=>"ztotal")• t.oidpointer(:path=>"oidpointer")•
• end• • def to_solr(solr_doc=Hash.new)• super(solr_doc)• solr_doc['oid_isi'] = oid• solr_doc['cid_isi'] = cid• solr_doc['projid_isi'] = projid• solr_doc['zindex_isi'] = zindex• solr_doc['parentoid_isi'] = parentoid• solr_doc['ztotal_isi'] = ztotal• solr_doc['oidpointer_isi'] = oidpointer• solr_doc• end • end• end• end
35
Workflow reviewWorkflow review
1. Add folder with files to import folder2. Process folder. This will create the records in the database (oids, job
tracking,c# instances, and file derivatives)3. Export spreadsheet. This will create a spreadsheet template for the folder of
files in (1)4. Fill in metadata in spreadsheet – the main cataloging task.5. Import spreadsheet. This will ultimately populate the c# with metadata from
the oid rows of the spreadsheet.6. Publish to hydra. This will create the hydra tables with serialized metadata
files(MODS, access rights), and stage files in storage for ingest.
36
Ingest taskIngest task
• Set up within a hydra project• gem ‘tiny_tds’ connect to the ladybird SQL Server
database
37
app/models (objects)app/models (objects)
• collection.rb – maps to pid (project) in ladybird, parent to simple.rb and complex_parent.rb
• simple.rb – 1 image w/derivatives, no hierarchy• complex_parent.rb – parent to a set of images (like a
book or image set)• complex_child.rb – 1 image w/derivatives (like a page
These relate to the hydra_content_model table
38
app/model (datastreams)app/model (datastreams)
• coll_properties.rb• properties.rb• rights.rb• access_conditions.rb• simple_mods.rb
39
simple_mods.rb - indexingsimple_mods.rb - indexing
40
rake yulhy4:ingest Irake yulhy4:ingest I
Properties:• SQL server connection config• Mount of ladybird storage
Uses the hydra_publish table as a queue (driven by this query until done):
• select top 1 a.hpid,a.oid,a.cid,a.pid,b.contentModel,a._oid from dbo.hydra_publish a, dbo.hydra_content_model b where a.dateHydraStart is null and a.dateReady is not null and a._oid=0 and a.hcmid is not null and a.hcmid=b.hcmid and a.action='insert' order by a.dateReady")
•
41
rake yulhy4:ingest II rake yulhy4:ingest II
ActiveFedora ingest
Create new object based on content modelobj = Simple.newobj = ComplexParent.newobj = ComplexChild.new
42
Rake yulhy4:ingest IIIRake yulhy4:ingest III
Iterate through all datastreams for the content model• select hcmds.dsid as dsid,hcmds.ingestMethod as ingestMethod,
hcmds.required as required from dbo.hydra_content_model hcm, dbo.hydra_content_model_ds hcmds where hcm.contentModel = '#{contentModel}' and hcm.hcmid = hcmds.hcmid/)
For each in above query get the datastream info for the oid• select
type,pathHTTP,pathUNC,md5,controlGroup,mimeType,dsid,OIDpointer from dbo.hydra_publish_path where hpid=#{i["hpid"]} and dsid='#{dsid}'/)
Verify checksums and use activeFedora to ingest datastreams
43
rake yulhy4:ingest IVrake yulhy4:ingest IV
Add ladybird specific info to properties datastream• oid• cid• pid• zindex• _oid
Add hierarchical info to RELS-EXT• Simple and complex_parent – is_member_of a collection• Complex_child – is member of a complex_parent
Some discussion about adding more linked data.
44
Rake yulhy4:ingest VRake yulhy4:ingest V
45
Rake yulhy4:ingest VI Rake yulhy4:ingest VI
46
Blacklight Blacklight
47
reviewreview
48
futurefuture
Hydra_publish – revise already ingested content• action=‘update’• action=‘insert’
Archivematica (by artefactual)• Replace the ingest task with a custom workflow• GUI interface• Human decision points and manual processing• Technical metadata generation (FITS)• Provenance (jhove)• Issues – how to employ OAI packages (SIP,AIP,DIP) for
objects without a natural package structure?
49
ContributorsContributors
• Eric James• Lakeisha Robinson• Kalee Sprague• Osman Din• Jay Terray• Rebekeh Irwin• Mike Friscia
50
Thank youThank you