mining the blob: there's gold in the directory!
DESCRIPTION
Mining the Blob: There's Gold in the Directory!. Kathryn Lybarger @ zemkat EBUG 2014 (Morehead, Ky.) #EBUG2014 June 20, 2014. MARC. Patron searching. Staff searching. Searching what?. Though the catalog: was created from raw MARC - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/1.jpg)
Mining the Blob: There's Gold in the Directory!
Kathryn Lybarger @zemkatEBUG 2014 (Morehead, Ky.) #EBUG2014June 20, 2014
![Page 2: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/2.jpg)
MARC
![Page 3: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/3.jpg)
![Page 4: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/4.jpg)
Patron searching
![Page 5: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/5.jpg)
Staff searching
![Page 6: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/6.jpg)
Searching what?
Though the catalog:was created from raw MARCactually contains that MARC still
When we search, we are searching indexesHigh-speed access to common elementsRelationship between elementsSome elements grouped into one indexSome elements modified to be more useful
![Page 7: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/7.jpg)
![Page 8: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/8.jpg)
![Page 9: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/9.jpg)
Modified fields : call number
MFHD_MASTER.DISPLAY_CALL_NOCall number as it would displayPS7 .A6 1937PS1000.A8 G53 1856
MFHD_MASTER.NORMALIZED_CALL_NONormalized spacing so that things sort properlyPS 7 A 6 1937PS 1000 A 8 G 53 1856
![Page 10: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/10.jpg)
Grouped elements
Many indexed fields (including all 6XX) are indexed together in BIB_INDEXYou can distinguish between them using the INDEX_CODE
![Page 11: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/11.jpg)
Functions
String functionsMID – extract only part of a field
MID(FIELD, start, length)LEFT, RIGHT – extract left or right side
Aggregate functionsMIN, MAX – minimum or maximumCOUNT – how many match?
![Page 12: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/12.jpg)
A common conversation…
How do I get this data?Just use the common backbone.
How about this other data?Link in these more obscure tables.
And this data?Use functions to extract and group.
And what about this other thing?Oh… you’ll have to use the BLOB. Sorry.
![Page 13: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/13.jpg)
![Page 14: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/14.jpg)
The BLOB
Binary Large ObjectA lump of binary data stored as a single entity in a databaseNot indexed into its individual meaningful partsMuch slower than pre-defined indexes
In VoyagerBIB_DATA
BIB_ID – record IDRECORD_SEGMENT – actually binary data (990 bytes)SEQNUM – which segment (record may be longer)
![Page 15: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/15.jpg)
BLOB functions
FunctionsGetAuthBlob(AUTH_ID)GetBibBlob(BIB_ID)GetMFHDBlob(MFHD_ID)
Examples:GetField(GetBibBlob([BIB_TEXT].[BIB_ID]),’6’,1)GetSubField(GetFieldRaw(GetBibBlob([BIB_TEXT].[BIB_ID]),’650’,1),’x’,2)
![Page 16: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/16.jpg)
Binary structure
“Binary” is a bit misleadingMostly readable charactersSome control characters
Subfield delimiter, end-of-field, end-of-record
MARC structureLeaderDirectory
An index to the variable fieldsVariable fields
![Page 17: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/17.jpg)
MARC in a text editor
![Page 18: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/18.jpg)
MARC leader
First 24 characters of the MARC recordSome of these will always be the same
10 – indicator count (always 2 – ind1, ind2)11 – subfield code length (always 2, $b)
Some vary with the record00-04 – record length12-16 – base address of data
![Page 19: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/19.jpg)
A few questions
What is the shortest record in the catalog?
What is the longest record in the catalog?
(are these records any good?)
![Page 20: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/20.jpg)
Shortest record (SQL)
MID(BIB_DATA.RECORD_SEGMENT, 1, 5)First five characters of the segmentWe want the smallest one of theseOnly check the first segment of a record
SELECT MIN(MID(RECORD_SEGMENT, 1, 5)) FROM BIB_DATA WHERE SEQNUM=‘1’;
This only gives the lengthWe want the actual shortest record(s)
![Page 21: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/21.jpg)
Shortest record
SELECT BIB_ID FROM BIB_DATA WHERE SUBSTR(RECORD_SEGMENT,1,5) = (SELECT MIN(SUBSTR(RECORD_SEGMENT,1,5)) FROM BIB_DATA WHERE SEQNUM='1');
![Page 22: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/22.jpg)
GitHub
To avoid filling slides with SQL queries, I’ve made them available in a GitHub repository
I encourage you to try my queries in Access, and let me know what you find in your catalog!
http://github.com/zemkat/Voyager/
![Page 23: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/23.jpg)
Our smallest unsuppressed record
![Page 24: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/24.jpg)
More small unsuppressed…
![Page 25: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/25.jpg)
Largest unsuppressed record?
416 links – yikes!
![Page 26: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/26.jpg)
MARC directory
Occurs right after the directoryFrom byte 24 to (base address of data) -1
Each variable field has a triplet (12 bytes) in the directory
(three bytes) – tag(next four bytes) – length of field(next five bytes) – starting position
Variable fields have indicators, subfield codes, field data
![Page 27: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/27.jpg)
How many fields?
Directory length =base data address- 24 (for leader)- 1 (field terminator for leader)- 1 (field terminator for directory)+ 1 (since we start counting at zero)
Directory length / 12 = number of variable fields
A record with this leader has 22 variable fields:01012nam a22002897 4500
![Page 28: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/28.jpg)
A few questions
Which record has the most fields?
Which record has the fewest fields?
(are these records any good?)
![Page 29: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/29.jpg)
![Page 30: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/30.jpg)
Size outliers
Too many fieldsLinks to every issue of a serialISBNs for every volumeTitles of every Slovak folk song ever
Too few fieldsProvisional order records that never got overlaidNon-Roman alphabet “[ARABIC CHARACTERS]”Mystery from past catalog
![Page 31: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/31.jpg)
Other tables to join in…
What are the smallest records with call numbers in their holdings?
What are the smallest records with OCLC numbers?
Smallest records by location?
![Page 32: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/32.jpg)
Tag Report
Which fields are we using? (how many records does each appear in?)Are we using any that we shouldn’t?
What percentage of our records does each field appear in?
SoftwareRuns on linux, outputs Excel filesOpen source, available on github
![Page 33: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/33.jpg)
![Page 34: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/34.jpg)
Missing or repeating
The Voyager tag table specifies:What is mandatoryWhat may repeat
Tag table can be bypassed:Preference in Cataloging moduleBulk import
How to find missing mandatory fields?Repeating fields that shouldn’t?
![Page 35: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/35.jpg)
Triple Nickel – all instances of a field
How are we using this tag in Voyager?
Are we using it consistently? (grouping)
Are we using it enough that we should add it to our OPAC display?
SoftwareRuns on linux, outputs Excel filesOpen source, available on github
![Page 36: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/36.jpg)
Record growth
OCLC Bibliographic Notification(now in Worldshare Metadata Collection Manager)Informs you which of your records have changed in OCLCDownload those records
ProblemsAvoid overriding local improvementsToo many records change to look at them all!How much have they changed (for us) ?
![Page 37: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/37.jpg)
Record growth
Focus on:Which records have grown the most?(Byte size or number of fields)
Compare size of new record against equivalent in Voyager
Sort this list to only look at large improvements
(software unreleased)
![Page 38: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/38.jpg)
EigenRecords
What do our MARC records look like, on average?
What types do we have? How do they vary?
Investigate using principal component analysis of directory data
(in progress)
![Page 39: Mining the Blob: There's Gold in the Directory!](https://reader031.vdocuments.us/reader031/viewer/2022013012/568152aa550346895dc0cd7e/html5/thumbnails/39.jpg)
Any questions?
Kathryn [email protected]
Twitter: @zemkat
Blogs:http://pc.blog.zemows.orghttp://problem-cataloger.tumblr.com