toro 1 emu on a diet yale campus peabody collections counts & functional cataloguing unit...

Toro 1

EMu on a Diet

Yale campus

Peabody CollectionsCounts & Functional Cataloguing Unit

• Anthropology 325,000 Lot• Botany 350,000 Individual• Entomology 1,000,000 Lot• Invertebrate Paleontology 300,000 Lot• Invertebrate Zoology 300,000 Lot• Mineralogy 35,000 Individual• Paleobotany 150,000 Individual• Scientific Instruments 2,000 Individual• Vertebrate Paleontology 125,000 Individual• Vertebrate Zoology 185,000 Lot / Individual

2.7 million database-able units => ~11 million items

Peabody CollectionsFunctional Units Databased

• Anthropology 325,000 90 %• Botany 350,000 1 %• Entomology 1,000,000 3 %• Invertebrate Paleontology 300,000 60 %• Invertebrate Zoology 300,000 25 %• Mineralogy 35,000 85 %• Paleobotany 150,000 60 %• Scientific Instruments 2,000 100 %• Vertebrate Paleontology 125,000 60 %• Vertebrate Zoology 185,000 95 %

990,000 of 2.7 million => 37 % overall

The four YPM buildings

Peabody(YPM)

EnvironmentalScience Center

(ESC)

Geology / Geophysics(KGL)

175 Whitney(Anthropology)

VZKristof Zyskowski (Vert. Zool. - ESC)

Greg Watkins-Colwell(Vert. Zool. - ESC)

HSIShae Trewin

(Scientific Instruments – KGL )

VPMary Ann Turner

(Vert. Paleo. – KGL / YPM)

ANTMaureen DaRos

(Anthro. - YPM / 175 Whitney)

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000

% Databased vs. Collection Size (in 1000s of items)

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000

BotanyEntomologyInvertebrate PaleontologyInvertebrate Zoology

% Databased vs. Collection Size (in 1000s of items)

• 1991 Systems Office created & staffed

Peabody CollectionsApproximate Digital Timeline


• 1992 Argus collections databasing initiative started




• 1994 Gopher services launched for collections data





• 1997 Gopher mothballed, Web / HTTP services launched






• 1998 Physical move of many collections “begins”

• 2002 Physical move of many collections “ends”








• 2003 Search for Argus successor commences

• 2003 Informatics Office created & staffed








• 2003 Search for Argus successor commences

• 2003 Informatics Office created & staffed

• 2004 KE EMu to succeed Argus, data migration begins

• 2005 Argus data migration ends, go-live in KE EMu


EMu migration in '05(all disciplines went live

simultaneously)

Physical move in ‘98-'02(primarily neontological disciplines)

Big events

What do you do …

What do you do …

… when your EMu is out of shape & sluggish ?

The Peabody Museum Presents

What clued us in that we should put our EMu on a diet ?

The Peabody Museum Presents

980 megabytes in Argus

10,400 megabytes in EMu

Area of Server Occupied by Catalogue

?

Area of Server Occupied by Catalogue

980 megabytes in Argus

10,400 megabytes in EMu

Default EMu “cron” maintenance job schedule

Mo Tu We Th Fr Sa Su

late night

workday

evening= emulutsrebuild

= emumaintenance batch

= emumaintenance compact

late night

workday






Three Fabulously Easy Steps !


• 1. The Legacy Data Burnoff• ( best quick loss plan ever ! )


• 1. The Legacy Data Burnoff• ( best quick loss plan ever ! )

• 2. The Darwin Core Binge & Purge • ( eat the big enchilada and still end up thin ! )


• 1. The Legacy Data Burnoff • ( best quick loss plan ever ! )

• 2. The Darwin Core Binge & Purge • ( eat the big enchilada and still end up thin ! )

• 3. The Validation Code SlimDing • ( your Texpress metabolism is your friend ! )

1. The Legacy Data Burnoff

Anatomy of the ecatalogue database

File Name Function

~/emu/data/ecatalogue/data the actual data

~/emu/data/ecatalogue/rec indexing (part)

~/emu/data/ecatalogue/seg indexing (part)

The combined size of these was 10.4 gb -- 4 gb in data and 3 gb in each of rec and seg

980 mB 10,400 mB

The ecatalogue database was a rate limiter

typical EMu data directory23 files, 2 subdirs

Closer Assessment of Legacy Data

In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody’s modules by KE prior to migration -- Legacy Data fields were among them

Closer Assessment of Legacy Data

sites – round 2

constant data

lengthy prefixes

sites – round 2

data of temporary use in migration

catalogue – round 2data

rec

seg

Repetitive scripting of texexport & texload jobs

Conducted around a million updates of records

Manually adjusted cron jobs to accommodate

Did the work at night over six-month-long period

Watched process closely to keep from filling server disks

How did we do the LegacyData Burnoff in 2005 ?

Repetitive scripting of texexport & texload jobs

Conducted around a million updates of records

Manual;y adjusted nightly cron jobs to accommodate

Did the work at night over six-month-long period

Watched process closely to keep from filling server disks

How did we do the LegacyData Burnoff in 2005 ?

ecatalogue

data

rec

seg

Crunch 2data

rec

seg

delete nulls from AdmOriginalData

ecatalogue

Crunch 3data

rec

seg


shorten labels on AdmOriginalData

ecatalogue

Crunch 4data

rec

seg



delete prefixes on AdmOriginalData

ecatalogue

Crunch 4data

rec

seg



delete prefixes on AdmOriginalData

ecatalogue

Wow ! 55 % reduction !

2. The Darwin Core Binge & Purge

Charles Darwin, 1809-1882

Natural History Metadata Standard

“ DwC ”

Affords interoperability of different database systems

Widely used in collaborative informatics initiatives

Circa 40-50 fields depending on particular version

Directly analogous to the Dublin Core standard

Populate DwC fields at 3.2.02 upgrade in 2006… so what ?


IZ Department: total characters existing data 43,941,006


IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000


IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %

We’re about to gain back most of the pounds we just lost in the Legacy Data Burnoff !


rec

seg


rec

segaction in ecollectionevents


rec

segaction in eparties


rec

segaction in ecatalogue


rec

segBefore actions


rec

segAfter actions

ExtendedData

ExtendedData

SummaryData

ExtendedData

SummaryData

ExtendedData field is a full duplication ofIRN + SummaryData fields… delete theExtendedData field, use SummaryDatawhen in “thumbnail mode” on records

Populate DwC fields at 3.2.02 upgrade… so what ?

IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %


IZ Department: total characters modified data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %


IZ Department: total characters existing data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %

Some pain, but NO weight gain !

3. The Validation Code SlimDing

We’ve taken off the easiest pounds… any other fields to trim ?Some sneakily subversive texpress tricks


Can history of query behavior by users help identify some EMu soft spots ?



If so, can we slip EMu a “dynamic diet pill” into its computer code ?



If so, can we slip EMu a “dynamic diet pill” into its computer code ?

texadmin

…you make certain common types of changes to any record in any EMu module

…and automatic changes then propagate via “emuload” to numerous records in linked modules

…those linked modules can grow a lot and slow EMu significantly between maintenance runs

EMu actions in the background you don’t see

Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !

Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !

Need to first understand how different EMu queries work

Drag and Drop Query

Drag and Drop Query

checks the link field

Straight Text Entry Query

instead checks a local copy of the SummaryData from the linked record

that has been inserted into the catalogue

EMu’s audit log - gigantic activity trail

How often do users employ these two verydifferent query strategies, on what fields,

and are there distinctly divergent patterns ?

catalogue audit

In this one week sample, only 7 of 52 queries for accessions from insidethe catalogue module used text queries, the other 45 were drag & drops

Of those 7 text queries, every one asked for a primary id numberfor the accession, or the numeric piece of that number, but notfor any other type of data from within those accessions

Over a full year of catalogue audit data, less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).


This is where we gain our SlimDing advantage !



We don’t need more than the primary id of the accession record in the local copy of the accession module data stored in the catalogue module.



We don’t need more than the primary id of the accession record in the local copy of the accession module data stored in the catalogue module.

This pattern also held true for queries launched from the catalogue against the bibliography and loans modules !

Catalogue Database

Catalogue Database

Catalogue module lost

another 19% of its bulk

over a couple months !

Internal Movements Database


Internal movements

dropped from 550 mbytes

down to 200 mbytes…

65% reduction !

late night

workday







late night

workday





* * *

Quick backup

A Happy EMu Means Happy Campers

toro 1 emu on a diet yale campus peabody collections counts & functional cataloguing unit...

Documents

emu slide

esc slide

whitney slide

diet slide

catalogue slide

overall slide

emumaintenance compact

kgl ypm slide