data clean-up: is there a better way?

40
Data Clean- Up: Is there a Better Way? Margaret Hogarth ER&L, 2/2/2010 © Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40- 06-12/42 1

Upload: electronic-resources-libraries

Post on 17-Jan-2015

5.986 views

Category:

Education


3 download

DESCRIPTION

Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.

TRANSCRIPT

Page 1: Data Clean-up: Is There A Better Way?

Data Clean-Up:

Is there a Better Way?

Margaret Hogarth

ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1

Page 2: Data Clean-up: Is There A Better Way?

To Be Covered:

• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?

© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor

2

Page 3: Data Clean-up: Is There A Better Way?

Commonalities

• Data sources• Tools• Data clean-up• Capture and use

issues• Needed technical

skills

bonobokids.com

3

Page 4: Data Clean-up: Is There A Better Way?

General Approach

• Keep original data original; copy to a new worksheet

• Save frequently• Use meaningful

file/worksheet names• Folder of “final”

documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761

4

Page 5: Data Clean-up: Is There A Better Way?

Issues

• Import issues• System limitations• Dirty data• Non-standardized

data

• Standardized but variable data

• Application-related issues• Others?

© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903

5

Page 6: Data Clean-up: Is There A Better Way?

Deficits

• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems

© John Murphy, “Men Digging," ARTstor DMCC.1993.4

6

Page 7: Data Clean-up: Is There A Better Way?

System Limitations

• Field character limits• Report field limits

7

Page 8: Data Clean-up: Is There A Better Way?

Import => Excel, Zeros

Problem:

• Loss of trailing zero• Loss of leading zeros

Solution:• Import > Delimited >

Tab > Text format

Example:• 1944826 =>1944-8260• 14826 => 0001-4826

8

Page 9: Data Clean-up: Is There A Better Way?

Add Hyphens to ISSNs - 1

• Sort by ISSN to make sure leading zeros are intact.

• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)

• Syntax: =MID(text,start_num,num_chars)

9

Page 10: Data Clean-up: Is There A Better Way?

Add Hyphens to ISSNs - 2

• Or use Cell Formatting

• Select range• Format Cells

(CTRL+1) > Number tab > Custom > 0000-0000 > OK

10

Page 11: Data Clean-up: Is There A Better Way?

Import => Excel, Numbers

Problem:• ID # garbled

Solution:• Choose Number format• Remove decimal places

11

Page 12: Data Clean-up: Is There A Better Way?

Import => Excel, Commas

Problem:

Solution:

Comma

12

Page 13: Data Clean-up: Is There A Better Way?

Restore Dropped Leading Zeros

• Change to TEXT format

• Sort by ISSN• Use

=CONCATENATE(“000”,A1) to add zeros

13

Page 14: Data Clean-up: Is There A Better Way?

Excel: Remove Quotes

• Select column• Find & Select >

Replace > Find what: [space]” > Replace with: [leave blank]

14

Page 15: Data Clean-up: Is There A Better Way?

Remove Non-Printable Characters• Use TRIM (removes

ASCII value 32 = space character, except single spaces between words)

• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157

• Use SUBSTITUTE for higher codes

Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3

15

Page 16: Data Clean-up: Is There A Better Way?

ASCII Characters

http://en.wikipedia.org/wiki/ASCII 12/22/2009

16

Page 17: Data Clean-up: Is There A Better Way?

Access: Subscript Out of Range - 1

Try These Steps:• Check for spaces in

column headings• Use TRIM (removes

ASCII value 32 = space character)

• Use CLEAN for ASCII code 0-31

• Delete empty right columns

• Remove empty “used” cells:

1.Find end of "used cells": CTRL+SHIFT+END

2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.

17

Page 18: Data Clean-up: Is There A Better Way?

Access: Subscript Out of Range - 2

• Copy and paste cells into a new workbook. Save. Import into Access.

• Or, save file as CSV, import into Access. = Will see data error.

Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99

18

Page 19: Data Clean-up: Is There A Better Way?

Access: Type Conversion Failure

• Make sure data types in fields match data types in columns.

• Data like ISBNs are text but can be “read” like numbers.

• Add top row with correct data/type: XXX for ISBN

19

Page 20: Data Clean-up: Is There A Better Way?

Access: Remove Quotes

• Search for records

with “”: Criteria: LIKE "*" & Chr(34) & "*"

• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string

© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23

20

Page 21: Data Clean-up: Is There A Better Way?

Access: ISSN Issues

• Find too-short ISSNs: Len([FieldName])<n [9 is good here]

• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));

© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor

21

Page 22: Data Clean-up: Is There A Better Way?

Access & ARL Stats - 1

• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.

• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.

• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.

22

Page 23: Data Clean-up: Is There A Better Way?

Access & ARL Stats - 2

• RowCount:Count(*) Number of titles in a set.

• =[YTD Total]*[Cost] Annual cost-per-use.

• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx

23

Page 24: Data Clean-up: Is There A Better Way?

Access or Excel?

Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings

Excel:• Non-relational• Mostly numeric• Calculations/Statistics

Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx

24

Page 25: Data Clean-up: Is There A Better Way?

XML - Excel

• Excel can interpret XML

• Data > Get External Data > From XML Data Import

• Format without affecting source data

25

Later Excel: Activate Developer tab through Office logo (upper left)

Page 26: Data Clean-up: Is There A Better Way?

MarcEdit: XML - 1

• Convert large XML files to Excel

• Specify input, output files

• Choose MARC21XML => MARC

26

Page 27: Data Clean-up: Is There A Better Way?

MarcEdit: XML - 2• Choose display fields,

input, output files• View, format in Excel

27

Page 28: Data Clean-up: Is There A Better Way?

MarcEdit: MARC - 1

• Convert large files to local practices

DELETE existing

999 field

910 field(s)…

Copy 035 to 001

Remove (Sc-P) Prefix from 001

ADD

 910 |aDEL SCP ; jc ; 2009/7/8

Field:910

Field data: \\$aDEL SCP ; jc ; 2009/7/8

998 |an

Field: 998

Field data: \\$an

28

Page 29: Data Clean-up: Is There A Better Way?

MarcEdit: MARC - 2

29

Page 30: Data Clean-up: Is There A Better Way?

MarcEdit Information

• By Terry Reese

• http://oregonstate.edu/~reeset/marcedit/html/

• MARCEDIT-L listserv at  https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1   

• Regular updates

• Tutorials, templates, scripts

30

Page 31: Data Clean-up: Is There A Better Way?

ILS: “Global Update”• For records within ILS system

• For universal changes

• “Check website for coverage.”

31

Page 32: Data Clean-up: Is There A Better Way?

A Better Way? Macros

• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en

• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/

• http://www.lib-stats.org.uk/ (statistics listserv)

[Courtesy Tansy Matthews]

32

Page 33: Data Clean-up: Is There A Better Way?

Data Quality

Strategies to Improve Data Quality:

1. Identify problems

2. Treat data as an asset

3. Implement quality systems

Principle Activities for Data:

• Acquire• Store• Use

33

Page 34: Data Clean-up: Is There A Better Way?

Poor Quality Data Indicators

• Uncorrected errors• Redundant data/

processes• Lack of data for

strategizing• Frustration with data,

data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor

34

Page 35: Data Clean-up: Is There A Better Way?

Treat Data as an Asset

• Inventory data assets• Data = dynamic;

process = asset• Align responsibilities:

acquire, store, use data.

• Establish customer-supplier relationships for data. © The Metropolitan Museum of

Art, “Tournament,” late 16th century, ARTstor

35

Page 36: Data Clean-up: Is There A Better Way?

Apply Quality Principles

• Create and keep a customer

• Detect and correct errors

• Determine root cause of defects

• Manage the process

• Communicate results• Audit supplier

performance

36

Page 37: Data Clean-up: Is There A Better Way?

Library Data Quality Policy - 1

Suppliers/Creators:• Understand users,

uses, & requirements• Ensure requirements

are met• Manage data creation

process

Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality

in IT

37

Page 38: Data Clean-up: Is There A Better Way?

Library Data Quality Policy - 2

Users:• Define requirements,

work with suppliers• Provide feedback• Interpret data

correctly• Use data legitimately• Protect privacy

Logistics:• Determine master

systems• Understand system

limitations• Accessible storage• Match inputs with

needs• Identify key keepers

38

Page 39: Data Clean-up: Is There A Better Way?

[email protected]

Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154

Other Techniques?

39

Page 40: Data Clean-up: Is There A Better Way?

Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.

http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.

• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.

• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx

• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.

• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.

• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009

40