digital asset management and publication with ladybird eric james programmer/analyst library it yale...

Digital Asset Management and Publication with LadyBird

Digital Asset Management and Publication with LadyBird

Eric Jamesprogrammer/analyst library ITYale University [email protected]

12 July 2013

Eric Jamesprogrammer/analyst library ITYale University [email protected]

12 July 2013

mailto:[email protected]

mailto:[email protected]

What is LadyBird?What is LadyBird?

• Bebop song by Tadd Dameron• First Lady, Lyndon B. Johnson presidency• Old dog from King of the Hill• Digital asset management tool

2

LadyBird - Digital Asset Management ToolLadyBird - Digital Asset Management Tool

3

LadyBird from its origin is a system which processes metadata and temporarily houses digital assets to be published. It provides a configurable system for migrating digital objects and collections, normalizing metadata, and preserving and publishing content.

It was initially writing in Microsoft .Net and C#, hosted on Windows 2008 using Microsoft SQL Server 2008.

Some work on java modules (for import)Wish list – To migrate to Jruby/rails.

LadyBird componentsLadyBird components

• Web interface• Job processing engine - imports• Export processing engine – exports• Bag creation• Heartbeat monitor• Application cleanup system

• This presentation will focus on the workflow and concepts involved in publication of digital objects w/ metadata to fedora

4

LadyBird concepts ILadyBird concepts I

• Core of the application is the object table• Collection – departments within the library and Yale

(later will come into play when discussing c# tables)• Project – projects specific to a collection• An object belongs to a project and a project belongs

to a collection• Currently 16 collections with 34 projects and 1.53

million objects• We call objects “oids”, technically “oid” means object

id column of the object table but we tend to use it to describe the whole ball of wax

• User table – cataloger is registered and roles and permissions setting are used throughout the app

5

LadyBird concepts II LadyBird concepts II

• Processing objects is all about the spreadsheet• Each row is an object• Each column represents either functions or metadata

• Functions ex – {F1} is the object as identified by oid(primary key of object table), if left blank that is signal to create a new oid

• {F4} parent oid (for complex objects)• {F40} can have a value PUBLISH telling ladybird to auto publish this object

• Metadata ex – {FDID=58} call number,{FDID=262} Host,creator,etc.

The cataloger can take advantage of excel functionality (like repeating fields) to quickly create a spreadsheet for batch import,

6

LadyBird concepts IIILadyBird concepts III

field_definition (fdid) table (230 metadata fields)

51 Cataloger52 Record source53 Record date54 Record modified date55 Record ID56 Local record ID57 Local record ID, other58 Call number59 Accession number60 Box

The values are either strings or acid values (more on acids later)

7

LadyBird concepts IVLadyBird concepts IV

• Import tables – all about the spreadsheets, though you can import MARC or EAD records by bibid, barcode, handle too, in that case the records are deserialized into fdids, and any spreadsheet data overrides the records

im_job (1 master row for spreadsheet)Im_job_exHead (column headers from spreadsheet)im_job_contents (values)Im_files(for files)import_checksum (for files)im_job_contents_history

• Job tracking (overall tracking associates a oid imported to a specific job)

trk_projecttrk_jobtrk_job_contentstrk_oid

8

LadyBird concepts VLadyBird concepts V

• The C# tables – c for “current”,# for each collection• The “Metadata home” - data imported to the im tables finally transferred here• There is a set of tables for each collection.

Ex: # = 13 (collection:Hydra, project: Hydra Test)c13 – master list of oids

c13_stringsc13_longstringsc13_acid

Each row contains basically a oid/fdid/value, thus given an oid one could get all metadata fields for that object as rows from this table. It also has a favid for additional values associated with the fdid.

There also corresponding p# tables, p for “past” that keep a audit trail of any updates to specific oids.

C#table designed for high volumeExploring better options, hashing

9

LadyBird concepts VILadyBird concepts VI

• Acid – authority control – a system for using controlled vocabulary for metadata fields

Fdid 62 = Host, Creator

Acid fdid value126434 62 Luhan, Mabel Dodge, 1879-1962126626 62 Dobbs, Arthur, 1689-1765126628 62 Filson, John, ca. 1747-1788126630 62 Thomson, Charles, 1729-1824126632 62 Hutchins, Thomas, 1730-1789126635 62 Adair, James, ca. 1709-1783

So If for an oid row in the spreadsheet the fdid 62 column was given the value 126635, that field would resolve to Adair, James, ca. 1709

Currently 155,415 values.Potential for more sophisticates uses with linked data.

10

LadyBird sample workflow startLadyBird sample workflow start

• Workstation mounted with a job folder for both import and export

Windows:\\birdcage.library.yale.edu\project25\import\

Mac: SMB://birdcage.library.yale.edu/project25//import//

Windows:\\birdcage.library.yale.edu\project25\export\

Mac:SMB://birdcage.library.yale.edu/project25//export//

• Project25 corresponds to the project table• Create a folder in the import directory and drag files into folders or subfolders• LadyBird will now have detected that folder and have created a job for this under

the “Dashboard” menu selection

11

LadyBird dashboardLadyBird dashboard

12

add digital object to folderadd digital object to folder

13

Got to dashboard and process this folderGot to dashboard and process this folder

14

Receive email confirmationReceive email confirmation

Subject: LadyBird Import Complete job: test_open_rep

Your import has been processed.test_open_repVisit your dashboard in Ladybird for your most recent jobs.http://ladybird.library.yale.edu/user_jobs.aspx

View job: http://ladybird.library.yale.edu/user_jobs.aspx?qa=query&qid=12307

* A jobcomplete.txt file with the time is added to import folder so app know that directory is complete

15

View jobView job

16

View setView set

17

New object->Metadata (form)New object->Metadata (form)

18

Or From View Set, “Export as Job”Or From View Set, “Export as Job”

19

Receive export email confirmationReceive export email confirmation

Subject: LadyBird Export Ready

Your export is ready. \\birdcage\project25\export\ermadmix_46371_06262013_165116.xls

20

Spreadsheet – fill in and save as tab-delimited text fileSpreadsheet – fill in and save as tab-delimited text file

21

ImportImport

22

Import Email ConfirmationImport Email Confirmation

Subject: LadyBird Import Complete job: ermadmix_import_062613_171134

Your import has been processed.ermadmix_import_062613_171134Visit your dashboard in Ladybird for your most recent jobs.http://ladybird.library.yale.edu/user_jobs.aspx

View job: http://ladybird.library.yale.edu/user_jobs.aspx?qa=query&qid=12313

23

PublishPublish

• Publishes automatically if {F40}=publish• Or can use interface to check file and metadata and

explicitly click the publish button

24

Publish (behind the scenes)Publish (behind the scenes)

• Oid is added to the hydra table with date (when added) and date published (when processing complete) timestamps

Id oid date date published

… … … …39176 10684347 2013-06-26 16:01:11.043 2013-06-26 17:14:05.90039177 10684348 2013-06-26 16:01:11.043 2013-06-26 17:14:07.45739178 10684349 2013-06-26 16:01:11.043 2013-06-26 17:14:09.01739179 10684350 2013-06-26 16:01:11.043 2013-06-26 17:14:10.57739180 10684351 2013-06-26 16:01:11.043 2013-06-26 17:14:12.13739181 10684352 2013-06-26 16:01:11.043 2013-06-26 17:14:13.697… … … …

25

oid added to hydra_publish tableoid added to hydra_publish table

Key fields:hpid: 23703hcmid: 2cid:9Pid: 27Oid: 10681633_oid: 0zindex: 0hydraID: nulldateReady: 2013-06-26 16:01:55.430dateHydraStart: null

26

Rows for oid added to hydra_publish_path tableRows for oid added to hydra_publish_path table

Key fields w/ example:hppid: 139004Hpid: 26340Type: jp2pathHTTP: http://lbfiles.library.yale.edu/10684274.jp2pathUNC: \\storage.yale.edu\home\ladybird-801001-yul\ladybird\project27\publish\

dl\10684274\1758.02.00.00_page1.jp2Md5: 35433b00ca9de2cdaed275c455339090controlGroup: MmimeType: image/jp2Dsid: jp2ingestMethod: filepathoidPointer: null

27

Hydra_publish_path – typical filesHydra_publish_path – typical files

xml rights (hydra rights)Xml metadata (MODS descMetadata)Xml access (home grown granular rights)pdf (transcript YIPP)pdf2 (annotated transcript YIPP)jp2 (derivative)jpg (derivatives)tif (master)

28

descMetadata - creationdescMetadata - creation

There is a service (c# class and methods) that is called upon hydra publish that iterates through all the fdids for an oid and uses the XML DOM to create a MODS file. This is basically a mapping of field definitions to the MODS schema.

There is the potential to map the fdids to any metadata format.

29

accessMetadataaccessMetadata

30

Rights metadataRights metadata

31

Transition in fedora hydra worldTransition in fedora hydra world

select * from hydra_content_model

id date uid contentModel1 2013-04-25 08:50:20.043 1 simple2 2013-04-25 08:50:26.350 1 complexParent3 2013-04-25 08:50:30.420 1 complexChild

ContentModel maps to ActiveFedora model

32

Transition into fedora hydra world IITransition into fedora hydra world II

select * from hydra_content_model_ds

id date uid hcmid dsid ingMethod required1 2013-04-25 08:56:11.6701 1 accessMetadata pullHTTP y2 2013-04-25 08:56:11.6701 1 descMetadata pullHTTP y3 2013-04-25 08:56:11.6701 1 rightsMetadata pullHTTP y4 2013-04-25 08:56:11.6701 1 tif filepath y5 2013-04-25 08:56:11.6701 1 jp2 filepath y6 2013-04-25 08:56:11.6701 1 jpg filepath y7 2013-04-25 08:56:11.6701 2 accessMetadata pullHTTP y8 2013-04-25 08:56:11.6701 2 descMetadata pullHTTP y9 2013-04-25 08:56:11.6701 2 rightsMetadata pullHTTP y10 2013-04-25 08:56:11.6701 2 tif filepath n11 2013-04-25 08:56:11.6731 2 jp2 filepath n12 2013-04-25 08:56:11.6731 2 jpg filepath n13 2013-04-25 08:56:11.6731 3 accessMetadata pullHTTP y14 2013-04-25 08:56:11.6731 3 descMetadata pullHTTP y15 2013-04-25 08:56:11.6731 3 rightsMetadata pullHTTP y16 2013-04-25 08:56:11.6731 3 tif filepath y17 2013-04-25 08:56:11.6731 3 jp2 filepath y18 2013-04-25 08:56:11.6731 3 jpg filepath y19 2013-05-31 10:48:25.6201 2 oidPointer pointer n20 2013-06-07 11:03:24.5371 2 pdf filepath n21 2013-06-07 11:03:52.9331 2 pdf2 filepath n

33

Example - simple content modelExample - simple content model

• require "active-fedora"• class Simple < ActiveFedora::Base• belongs_to :collection, :property=> :is_member_of• • has_metadata :name => 'descMetadata', :type => Hydra::Datastream::SimpleMods• has_metadata :name => 'accessMetadata', :type => Hydra::Datastream::AccessConditions• has_metadata :name => 'rightsMetadata', :type => Hydra::Datastream::Rights • has_metadata :name => 'propertyMetadata', :type => Hydra::Datastream::Properties• • delegate :oid, :to=>"propertyMetadata", :unique=>true• delegate :projid, :to=>"propertyMetadata", :unique=>true• delegate :cid, :to=>"propertyMetadata", :unique=>true• delegate :zindex, :to=>"propertyMetadata", :unique=>true• delegate :parentoid, :to=>"propertyMetadata", :unique=>true•

• end

34

Example – Properties DatastreamExample – Properties Datastream

• require 'active_fedora'• • module Hydra• module Datastream• class Properties < ActiveFedora::OmDatastream •

• #ERJ note ladybird pid = projid, ladybird _oid = parentoid • set_terminology do |t|• t.root(:path=>"root")•

• t.oid(:path=>"oid")• t.cid(:path=>"cid")• t.projid(:path=>"projid")• t.zindex(:path=>"zindex")• t.parentoid(:path=>"parentoid")• t.ztotal(:path=>"ztotal")• t.oidpointer(:path=>"oidpointer")•

• end• • def to_solr(solr_doc=Hash.new)• super(solr_doc)• solr_doc['oid_isi'] = oid• solr_doc['cid_isi'] = cid• solr_doc['projid_isi'] = projid• solr_doc['zindex_isi'] = zindex• solr_doc['parentoid_isi'] = parentoid• solr_doc['ztotal_isi'] = ztotal• solr_doc['oidpointer_isi'] = oidpointer• solr_doc• end • end• end• end

35

Workflow reviewWorkflow review

1. Add folder with files to import folder2. Process folder. This will create the records in the database (oids, job

tracking,c# instances, and file derivatives)3. Export spreadsheet. This will create a spreadsheet template for the folder of

files in (1)4. Fill in metadata in spreadsheet – the main cataloging task.5. Import spreadsheet. This will ultimately populate the c# with metadata from

the oid rows of the spreadsheet.6. Publish to hydra. This will create the hydra tables with serialized metadata

files(MODS, access rights), and stage files in storage for ingest.

36

Ingest taskIngest task

• Set up within a hydra project• gem ‘tiny_tds’ connect to the ladybird SQL Server

database

37

app/models (objects)app/models (objects)

• collection.rb – maps to pid (project) in ladybird, parent to simple.rb and complex_parent.rb

• simple.rb – 1 image w/derivatives, no hierarchy• complex_parent.rb – parent to a set of images (like a

book or image set)• complex_child.rb – 1 image w/derivatives (like a page

These relate to the hydra_content_model table

38

app/model (datastreams)app/model (datastreams)

• coll_properties.rb• properties.rb• rights.rb• access_conditions.rb• simple_mods.rb

39

simple_mods.rb - indexingsimple_mods.rb - indexing

40

rake yulhy4:ingest Irake yulhy4:ingest I

Properties:• SQL server connection config• Mount of ladybird storage

Uses the hydra_publish table as a queue (driven by this query until done):

• select top 1 a.hpid,a.oid,a.cid,a.pid,b.contentModel,a._oid from dbo.hydra_publish a, dbo.hydra_content_model b where a.dateHydraStart is null and a.dateReady is not null and a._oid=0 and a.hcmid is not null and a.hcmid=b.hcmid and a.action='insert' order by a.dateReady")

•

41

rake yulhy4:ingest II rake yulhy4:ingest II

ActiveFedora ingest

Create new object based on content modelobj = Simple.newobj = ComplexParent.newobj = ComplexChild.new

42

Rake yulhy4:ingest IIIRake yulhy4:ingest III

Iterate through all datastreams for the content model• select hcmds.dsid as dsid,hcmds.ingestMethod as ingestMethod,

hcmds.required as required from dbo.hydra_content_model hcm, dbo.hydra_content_model_ds hcmds where hcm.contentModel = '#{contentModel}' and hcm.hcmid = hcmds.hcmid/)

For each in above query get the datastream info for the oid• select

type,pathHTTP,pathUNC,md5,controlGroup,mimeType,dsid,OIDpointer from dbo.hydra_publish_path where hpid=#{i["hpid"]} and dsid='#{dsid}'/)

Verify checksums and use activeFedora to ingest datastreams

43

rake yulhy4:ingest IVrake yulhy4:ingest IV

Add ladybird specific info to properties datastream• oid• cid• pid• zindex• _oid

Add hierarchical info to RELS-EXT• Simple and complex_parent – is_member_of a collection• Complex_child – is member of a complex_parent

Some discussion about adding more linked data.

44

Rake yulhy4:ingest VRake yulhy4:ingest V

45

Rake yulhy4:ingest VI Rake yulhy4:ingest VI

46

Blacklight Blacklight

47

reviewreview

48

futurefuture

Hydra_publish – revise already ingested content• action=‘update’• action=‘insert’

Archivematica (by artefactual)• Replace the ingest task with a custom workflow• GUI interface• Human decision points and manual processing• Technical metadata generation (FITS)• Provenance (jhove)• Issues – how to employ OAI packages (SIP,AIP,DIP) for

objects without a natural package structure?

49

ContributorsContributors

• Eric James• Lakeisha Robinson• Kalee Sprague• Osman Din• Jay Terray• Rebekeh Irwin• Mike Friscia

50

Thank youThank you

digital asset management and publication with ladybird eric james programmer/analyst library it yale...

Documents

subfolders ladybird

yale university library

ladybird concepts v

job folder

project table

object metadata ex

ladybird eric james

metadata fields fdid