data clean-up: is there a better way?
DESCRIPTION
Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.TRANSCRIPT
Data Clean-Up:
Is there a Better Way?
Margaret Hogarth
ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1
To Be Covered:
• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?
© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor
2
Commonalities
• Data sources• Tools• Data clean-up• Capture and use
issues• Needed technical
skills
bonobokids.com
3
General Approach
• Keep original data original; copy to a new worksheet
• Save frequently• Use meaningful
file/worksheet names• Folder of “final”
documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761
4
Issues
• Import issues• System limitations• Dirty data• Non-standardized
data
• Standardized but variable data
• Application-related issues• Others?
© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903
5
Deficits
• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems
© John Murphy, “Men Digging," ARTstor DMCC.1993.4
6
System Limitations
• Field character limits• Report field limits
7
Import => Excel, Zeros
Problem:
• Loss of trailing zero• Loss of leading zeros
Solution:• Import > Delimited >
Tab > Text format
Example:• 1944826 =>1944-8260• 14826 => 0001-4826
8
Add Hyphens to ISSNs - 1
• Sort by ISSN to make sure leading zeros are intact.
• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)
• Syntax: =MID(text,start_num,num_chars)
9
Add Hyphens to ISSNs - 2
• Or use Cell Formatting
• Select range• Format Cells
(CTRL+1) > Number tab > Custom > 0000-0000 > OK
10
Import => Excel, Numbers
Problem:• ID # garbled
Solution:• Choose Number format• Remove decimal places
11
Import => Excel, Commas
Problem:
Solution:
Comma
12
Restore Dropped Leading Zeros
• Change to TEXT format
• Sort by ISSN• Use
=CONCATENATE(“000”,A1) to add zeros
13
Excel: Remove Quotes
• Select column• Find & Select >
Replace > Find what: [space]” > Replace with: [leave blank]
14
Remove Non-Printable Characters• Use TRIM (removes
ASCII value 32 = space character, except single spaces between words)
• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157
• Use SUBSTITUTE for higher codes
Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3
15
ASCII Characters
http://en.wikipedia.org/wiki/ASCII 12/22/2009
16
Access: Subscript Out of Range - 1
Try These Steps:• Check for spaces in
column headings• Use TRIM (removes
ASCII value 32 = space character)
• Use CLEAN for ASCII code 0-31
• Delete empty right columns
• Remove empty “used” cells:
1.Find end of "used cells": CTRL+SHIFT+END
2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.
17
Access: Subscript Out of Range - 2
• Copy and paste cells into a new workbook. Save. Import into Access.
• Or, save file as CSV, import into Access. = Will see data error.
Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99
18
Access: Type Conversion Failure
• Make sure data types in fields match data types in columns.
• Data like ISBNs are text but can be “read” like numbers.
• Add top row with correct data/type: XXX for ISBN
19
Access: Remove Quotes
• Search for records
with “”: Criteria: LIKE "*" & Chr(34) & "*"
• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string
© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23
20
Access: ISSN Issues
• Find too-short ISSNs: Len([FieldName])<n [9 is good here]
• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));
© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor
21
Access & ARL Stats - 1
• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.
• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.
• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.
22
Access & ARL Stats - 2
• RowCount:Count(*) Number of titles in a set.
• =[YTD Total]*[Cost] Annual cost-per-use.
• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx
23
Access or Excel?
Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings
Excel:• Non-relational• Mostly numeric• Calculations/Statistics
Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx
24
XML - Excel
• Excel can interpret XML
• Data > Get External Data > From XML Data Import
• Format without affecting source data
25
Later Excel: Activate Developer tab through Office logo (upper left)
MarcEdit: XML - 1
• Convert large XML files to Excel
• Specify input, output files
• Choose MARC21XML => MARC
26
MarcEdit: XML - 2• Choose display fields,
input, output files• View, format in Excel
27
MarcEdit: MARC - 1
• Convert large files to local practices
DELETE existing
999 field
910 field(s)…
Copy 035 to 001
Remove (Sc-P) Prefix from 001
ADD
910 |aDEL SCP ; jc ; 2009/7/8
Field:910
Field data: \\$aDEL SCP ; jc ; 2009/7/8
998 |an
Field: 998
Field data: \\$an
28
MarcEdit: MARC - 2
29
MarcEdit Information
• By Terry Reese
• http://oregonstate.edu/~reeset/marcedit/html/
• MARCEDIT-L listserv at https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1
• Regular updates
• Tutorials, templates, scripts
30
ILS: “Global Update”• For records within ILS system
• For universal changes
• “Check website for coverage.”
31
A Better Way? Macros
• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en
• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/
• http://www.lib-stats.org.uk/ (statistics listserv)
[Courtesy Tansy Matthews]
32
Data Quality
Strategies to Improve Data Quality:
1. Identify problems
2. Treat data as an asset
3. Implement quality systems
Principle Activities for Data:
• Acquire• Store• Use
33
Poor Quality Data Indicators
• Uncorrected errors• Redundant data/
processes• Lack of data for
strategizing• Frustration with data,
data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor
34
Treat Data as an Asset
• Inventory data assets• Data = dynamic;
process = asset• Align responsibilities:
acquire, store, use data.
• Establish customer-supplier relationships for data. © The Metropolitan Museum of
Art, “Tournament,” late 16th century, ARTstor
35
Apply Quality Principles
• Create and keep a customer
• Detect and correct errors
• Determine root cause of defects
• Manage the process
• Communicate results• Audit supplier
performance
36
Library Data Quality Policy - 1
Suppliers/Creators:• Understand users,
uses, & requirements• Ensure requirements
are met• Manage data creation
process
Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality
in IT
37
Library Data Quality Policy - 2
Users:• Define requirements,
work with suppliers• Provide feedback• Interpret data
correctly• Use data legitimately• Protect privacy
Logistics:• Determine master
systems• Understand system
limitations• Accessible storage• Match inputs with
needs• Identify key keepers
38
Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154
Other Techniques?
39
Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.
http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.
• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.
• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx
• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.
• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.
• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009
40