toro 1 emu on a diet yale campus peabody collections counts & functional cataloguing unit...
TRANSCRIPT
Toro 1
EMu on a Diet
Yale campus
Peabody CollectionsCounts & Functional Cataloguing Unit
• Anthropology 325,000 Lot• Botany 350,000 Individual• Entomology 1,000,000 Lot• Invertebrate Paleontology 300,000 Lot• Invertebrate Zoology 300,000 Lot• Mineralogy 35,000 Individual• Paleobotany 150,000 Individual• Scientific Instruments 2,000 Individual• Vertebrate Paleontology 125,000 Individual• Vertebrate Zoology 185,000 Lot / Individual
2.7 million database-able units => ~11 million items
Peabody CollectionsFunctional Units Databased
• Anthropology 325,000 90 %• Botany 350,000 1 %• Entomology 1,000,000 3 %• Invertebrate Paleontology 300,000 60 %• Invertebrate Zoology 300,000 25 %• Mineralogy 35,000 85 %• Paleobotany 150,000 60 %• Scientific Instruments 2,000 100 %• Vertebrate Paleontology 125,000 60 %• Vertebrate Zoology 185,000 95 %
990,000 of 2.7 million => 37 % overall
The four YPM buildings
Peabody(YPM)
EnvironmentalScience Center
(ESC)
Geology / Geophysics(KGL)
175 Whitney(Anthropology)
VZKristof Zyskowski (Vert. Zool. - ESC)
Greg Watkins-Colwell(Vert. Zool. - ESC)
HSIShae Trewin
(Scientific Instruments – KGL )
VPMary Ann Turner
(Vert. Paleo. – KGL / YPM)
ANTMaureen DaRos
(Anthro. - YPM / 175 Whitney)
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000
% Databased vs. Collection Size (in 1000s of items)
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000
BotanyEntomologyInvertebrate PaleontologyInvertebrate Zoology
% Databased vs. Collection Size (in 1000s of items)
• 1991 Systems Office created & staffed
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
• 2003 Search for Argus successor commences
• 2003 Informatics Office created & staffed
Peabody CollectionsApproximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
• 1997 Gopher mothballed, Web / HTTP services launched
• 1998 Physical move of many collections “begins”
• 2002 Physical move of many collections “ends”
• 2003 Search for Argus successor commences
• 2003 Informatics Office created & staffed
• 2004 KE EMu to succeed Argus, data migration begins
• 2005 Argus data migration ends, go-live in KE EMu
Peabody CollectionsApproximate Digital Timeline
EMu migration in '05(all disciplines went live
simultaneously)
Physical move in ‘98-'02(primarily neontological disciplines)
Big events
What do you do …
What do you do …
… when your EMu is out of shape & sluggish ?
What do you do …
… when your EMu is out of shape & sluggish ?
The Peabody Museum Presents
What clued us in that we should put our EMu on a diet ?
The Peabody Museum Presents
980 megabytes in Argus
10,400 megabytes in EMu
Area of Server Occupied by Catalogue
?
Area of Server Occupied by Catalogue
980 megabytes in Argus
10,400 megabytes in EMu
Default EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Default EMu “cron” maintenance job schedule
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Default EMu “cron” maintenance job schedule
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Default EMu “cron” maintenance job schedule
Three Fabulously Easy Steps !
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff• ( best quick loss plan ever ! )
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff• ( best quick loss plan ever ! )
• 2. The Darwin Core Binge & Purge • ( eat the big enchilada and still end up thin ! )
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff • ( best quick loss plan ever ! )
• 2. The Darwin Core Binge & Purge • ( eat the big enchilada and still end up thin ! )
• 3. The Validation Code SlimDing • ( your Texpress metabolism is your friend ! )
1. The Legacy Data Burnoff
Anatomy of the ecatalogue database
File Name Function
~/emu/data/ecatalogue/data the actual data
~/emu/data/ecatalogue/rec indexing (part)
~/emu/data/ecatalogue/seg indexing (part)
The combined size of these was 10.4 gb -- 4 gb in data and 3 gb in each of rec and seg
980 mB 10,400 mB
The ecatalogue database was a rate limiter
typical EMu data directory23 files, 2 subdirs
Closer Assessment of Legacy Data
In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody’s modules by KE prior to migration -- Legacy Data fields were among them
Closer Assessment of Legacy Data
In 2005, we had initially adopted many of the existing formats for data elements from the USNM’s EMu client, to allow for rapid development of the Peabody’s modules by KE prior to migration -- Legacy Data fields were among them
Closer Assessment of Legacy Data
sites – round 2
constant data
lengthy prefixes
sites – round 2
data of temporary use in migration
catalogue – round 2data
rec
seg
Repetitive scripting of texexport & texload jobs
Conducted around a million updates of records
Manually adjusted cron jobs to accommodate
Did the work at night over six-month-long period
Watched process closely to keep from filling server disks
How did we do the LegacyData Burnoff in 2005 ?
Repetitive scripting of texexport & texload jobs
Conducted around a million updates of records
Manual;y adjusted nightly cron jobs to accommodate
Did the work at night over six-month-long period
Watched process closely to keep from filling server disks
How did we do the LegacyData Burnoff in 2005 ?
ecatalogue
data
rec
seg
Crunch 2data
rec
seg
delete nulls from AdmOriginalData
ecatalogue
Crunch 3data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
ecatalogue
Crunch 4data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
delete prefixes on AdmOriginalData
ecatalogue
Crunch 4data
rec
seg
delete nulls from AdmOriginalData
shorten labels on AdmOriginalData
delete prefixes on AdmOriginalData
ecatalogue
Wow ! 55 % reduction !
2. The Darwin Core Binge & Purge
Charles Darwin, 1809-1882
Natural History Metadata Standard
“ DwC ”
Affords interoperability of different database systems
Widely used in collaborative informatics initiatives
Circa 40-50 fields depending on particular version
Directly analogous to the Dublin Core standard
Populate DwC fields at 3.2.02 upgrade in 2006… so what ?
Populate DwC fields at 3.2.02 upgrade in 2006… so what ?
IZ Department: total characters existing data 43,941,006
Populate DwC fields at 3.2.02 upgrade in 2006… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000
Populate DwC fields at 3.2.02 upgrade in 2006… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %
We’re about to gain back most of the pounds we just lost in the Legacy Data Burnoff !
catalogue – round 2data
rec
seg
catalogue – round 2data
rec
segaction in ecollectionevents
catalogue – round 2data
rec
segaction in eparties
catalogue – round 2data
rec
segaction in ecatalogue
catalogue – round 2data
rec
segBefore actions
catalogue – round 2data
rec
segAfter actions
ExtendedData
ExtendedData
SummaryData
ExtendedData
SummaryData
ExtendedData field is a full duplication ofIRN + SummaryData fields… delete theExtendedData field, use SummaryDatawhen in “thumbnail mode” on records
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,941,006IZ Department: est. new DwC characters 20,000,000IZ Department: est. expansion factor 45 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters modified data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data 43,707,277IZ Department: total new DwC characters 22,358,461IZ Department: actual expansion factor - 0.1 %
Some pain, but NO weight gain !
3. The Validation Code SlimDing
We’ve taken off the easiest pounds… any other fields to trim ?Some sneakily subversive texpress tricks
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” into its computer code ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” into its computer code ?
texadmin
…you make certain common types of changes to any record in any EMu module
…and automatic changes then propagate via “emuload” to numerous records in linked modules
…those linked modules can grow a lot and slow EMu significantly between maintenance runs
EMu actions in the background you don’t see
Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !
Why not harness EMu’s continuously ravenous appetite for pushing local copies of linked fields into remote modules… and put it to work slimming for us !
Need to first understand how different EMu queries work
Drag and Drop Query
Drag and Drop Query
checks the link field
Straight Text Entry Query
instead checks a local copy of the SummaryData from the linked record
that has been inserted into the catalogue
EMu’s audit log - gigantic activity trail
How often do users employ these two verydifferent query strategies, on what fields,
and are there distinctly divergent patterns ?
catalogue audit
In this one week sample, only 7 of 52 queries for accessions from insidethe catalogue module used text queries, the other 45 were drag & drops
Of those 7 text queries, every one asked for a primary id numberfor the accession, or the numeric piece of that number, but notfor any other type of data from within those accessions
Over a full year of catalogue audit data, less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
Over a full year of catalogue audit data, less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
Over a full year of catalogue audit data, less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
We don’t need more than the primary id of the accession record in the local copy of the accession module data stored in the catalogue module.
Over a full year of catalogue audit data, less than 1% ofall the queries into accessions used other than the primary id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
We don’t need more than the primary id of the accession record in the local copy of the accession module data stored in the catalogue module.
This pattern also held true for queries launched from the catalogue against the bibliography and loans modules !
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue module lost
another 19% of its bulk
over a couple months !
Internal Movements Database
Internal Movements Database
Internal movements
dropped from 550 mbytes
down to 200 mbytes…
65% reduction !
Internal Movements Database
Internal Movements Database
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Mo Tu We Th Fr Sa Su
Default EMu “cron” maintenance job schedule
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Default EMu “cron” maintenance job schedule
* * *
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Default EMu “cron” maintenance job schedule
* * *
Mo Tu We Th Fr Sa Su
late night
workday
evening= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Default EMu “cron” maintenance job schedule
* * *
Quick backup
A Happy EMu Means Happy Campers
finis