kathryn lybarger @zemkat eluna 2014 (montreal) #eluna2014 may 1, 2014

42
Mining the Blob: There's Gold in the Directory! Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Upload: august-greer

Post on 31-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Mining the Blob: There's Gold in the Directory!

Kathryn Lybarger @zemkatELUNA 2014 (Montreal) #ELUNA2014May 1, 2014

Page 2: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

MARC

Page 3: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 4: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Patron searching

Page 5: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Staff searching

Page 6: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Using what?

Though the catalog:was created from raw MARCactually contains MARC still

We are using indexesHigh-speed access to common elementsRelationship between elementsSome elements grouped into one indexSome elements modified to be more useful

Page 7: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 8: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 9: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Modified fields : call number

MFHD_MASTER.DISPLAY_CALL_NOCall number as it would displayPS7 .A6 1937PS1000.A8 G53 1856

MFHD_MASTER.NORMALIZED_CALL_NONormalized spacing so that things sort properlyPS 7 A 6 1937PS 1000 A 8 G 53 1856

Page 10: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Grouped elements

Many indexed fields (including all 6XX) are indexed together in BIB_INDEXYou can distinguish between them using the INDEX_CODE

Page 11: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Functions

String functionsMID – extract only part of a field

MID(FIELD, start, length)LEFT, RIGHT – extract left or right side

Aggregate functionsMIN, MAX – minimum or maximumCOUNT – how many match?

Page 12: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

A common conversation…

How do I get this data?Just use the common backbone.

How about this other data?Link in these more obscure tables.

And this data?Use functions to extract and group.

And what about this other thing?Oh… you’ll have to use the BLOB. Sorry.

Page 13: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 14: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

The BLOB

Binary Large ObjectA lump of binary data stored as a single entity in a databaseNot indexed into its individual meaningful partsOften, slower than pre-defined indexes

In VoyagerBIB_DATA (and friends)

BIB_ID – record IDSEGMENT – actual binary data (990 bytes)SEQNUM – which segment (record may be longer)

Page 15: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

BLOB functions

FunctionsGetAuthBlob(AUTH_ID)GetBibBlob(BIB_ID)GetMFHDBlob(MFHD_ID)

Examples:GetField(GetBibBlob([BIB_TEXT].[BIB_ID]),’6’,1)GetSubField(GetFieldRaw(GetBibBlob([BIB_TEXT].[BIB_ID]),’650’,1),’x’,2)

Page 16: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

MARC (binary) structure

“Binary” is a bit misleadingMostly readable charactersSome control characters

Subfield delimiter, end-of-field, end-of-record

MARC record structureLeaderDirectory

An index to the variable fieldsVariable fields

Page 17: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

MARC in a text editor

Page 18: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

MARC leader

First 24 characters of the MARC recordSome of these will always be the same

10 – indicator count (always 2 – ind1, ind2)11 – subfield code length (always 2, $b)

Some vary with the record00-04 – record length12-16 – base address of data

Page 19: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

A few questions

What is the shortest record in the catalog?

What is the longest record in the catalog?

(are these records any good?)

Page 20: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Shortest record (SQL)

MID(BIB_DATA.RECORD_SEGMENT, 1, 5)First five characters of the segmentWe want the smallest one of theseOnly check the first segment of a record

SELECT MIN(MID(RECORD_SEGMENT, 1, 5)) FROM BIB_DATA WHERE SEQNUM=‘1’;

This only gives the lengthWe want the actual shortest record(s)

Page 21: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Shortest record

SELECT BIB_ID FROM BIB_DATA WHERE SUBSTR(RECORD_SEGMENT,1,5) = (SELECT MIN(SUBSTR(RECORD_SEGMENT,1,5)) FROM BIB_DATA WHERE SEQNUM='1');

Page 22: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

GitHub

To avoid filling slides with SQL queries, I’ve made them available in a GitHub repository

I encourage you to try my queries in Access, and let me know what you find in your catalog!

http://github.com/zemkat/Voyager/queries

Page 23: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Our smallest unsuppressed record

Page 24: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

More small unsuppressed…

Page 25: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Largest unsuppressed record?

416 links – yikes!

Page 26: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

MARC directory

Occurs right after the leaderFrom byte 24 to (base address of data) -1

Each variable field has a triplet (12 bytes) in the directory

(three bytes) – tag(next four bytes) – length of field(next five bytes) – starting position

Variable fields have indicators, subfield codes, subfield data

Page 27: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

How many fields?

Directory length =base data address- 24 (for leader)- 1 (field terminator for leader)- 1 (field terminator for directory)+ 1 (since we start counting at zero)

Directory length / 12 = number of variable fields

A record with this leader has 22 variable fields:01012nam a22002897 4500

Page 28: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

A few questions

Which record has the most fields?

Which record has the fewest fields?

(are these records any good?)

Page 29: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 30: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

All large or small?

Page 31: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Size outliers

Too many fieldsLinks to every issue of a serialISBNs for every volumeTitles of every Slovak folk song ever

Too few fieldsProvisional order records that never got overlaidNon-Roman alphabet “[ARABIC CHARACTERS]”Mystery from past catalog

Page 32: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Too many problems!

What are the smallest records with call numbers in their holdings?

What are the smallest records with OCLC numbers?

Smallest records by location?

Page 33: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Tag Report

Which fields are we using? (how many records does each appear in?)Are we using any that we shouldn’t?

What percentage of our records does each field appear in?

SoftwareRuns on linux, outputs Excel filesOpen source, available on github

Page 34: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014
Page 35: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Missing or repeating fields

The Voyager tag table specifies:What is mandatoryWhat may repeat

Tag table can be bypassed:Preference in Cataloging moduleBulk import

How to find missing mandatory fields? (GDC!)Repeating fields that shouldn’t? (GDC!)

Page 36: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Triple Nickel

How are we using this tag in Voyager?

Are we using it consistently?

How would this look in our OPAC display?

SoftwareRuns on linux, outputs Excel filesTwo sheets: raw, groupedOpen source, available on github

Page 37: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

506 report

Grouped to show common usage:

Page 38: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Sorted to show variance

Restrict to subscribers.Restricted subscribers.Restricted to subscribers.Restricted to scubscribers.Restricted to subesribers.Restricted to subsacribers.Restricted to subscibers.Restricted to subscribres.

Restricted to subscribrs.Restricted to subsribers.Restricted to suscribers.Restricted to users.Restrictged to subscribers.Restrictricted to subscribersRestrocted to subscribers.Restrticted to subscribers.

Page 39: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Record growth

OCLC Bibliographic Notification(now in Worldshare Metadata Collection Manager)Informs you which of your records have changed in OCLCDownload those records

ChallengesAvoid overriding local improvementsToo many records change to look at them all!How much have they changed (for us) ?

Page 40: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Record growth

Focus on:Which records have grown the most?(Byte size or number of fields)

Compare size of new record against equivalent in Voyager

Sort this list to only look at large improvements

(software unreleased)

Page 41: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

EigenRecords

What do our MARC records look like, on average?

What types do we have? How do they vary?

Investigate using principal component analysis of directory data

(in progress)

Page 42: Kathryn Lybarger @zemkat ELUNA 2014 (Montreal) #ELUNA2014 May 1, 2014

Any questions?

Kathryn Lybarger @[email protected]

GitHub:http://github.com/zemkat/Voyager/

Blogs:http://pc.blog.zemows.orghttp://problem-cataloger.tumblr.comhttp://library-computer.tumblr.com