data clean-up: is there a better way?

Post on 17-Jan-2015

5.986 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.

TRANSCRIPT

Data Clean-Up:

Is there a Better Way?

Margaret Hogarth

ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1

To Be Covered:

• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?

© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor

2

Commonalities

• Data sources• Tools• Data clean-up• Capture and use

issues• Needed technical

skills

bonobokids.com

3

General Approach

• Keep original data original; copy to a new worksheet

• Save frequently• Use meaningful

file/worksheet names• Folder of “final”

documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761

4

Issues

• Import issues• System limitations• Dirty data• Non-standardized

data

• Standardized but variable data

• Application-related issues• Others?

© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903

5

Deficits

• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems

© John Murphy, “Men Digging," ARTstor DMCC.1993.4

6

System Limitations

• Field character limits• Report field limits

7

Import => Excel, Zeros

Problem:

• Loss of trailing zero• Loss of leading zeros

Solution:• Import > Delimited >

Tab > Text format

Example:• 1944826 =>1944-8260• 14826 => 0001-4826

8

Add Hyphens to ISSNs - 1

• Sort by ISSN to make sure leading zeros are intact.

• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)

• Syntax: =MID(text,start_num,num_chars)

9

Add Hyphens to ISSNs - 2

• Or use Cell Formatting

• Select range• Format Cells

(CTRL+1) > Number tab > Custom > 0000-0000 > OK

10

Import => Excel, Numbers

Problem:• ID # garbled

Solution:• Choose Number format• Remove decimal places

11

Import => Excel, Commas

Problem:

Solution:

Comma

12

Restore Dropped Leading Zeros

• Change to TEXT format

• Sort by ISSN• Use

=CONCATENATE(“000”,A1) to add zeros

13

Excel: Remove Quotes

• Select column• Find & Select >

Replace > Find what: [space]” > Replace with: [leave blank]

14

Remove Non-Printable Characters• Use TRIM (removes

ASCII value 32 = space character, except single spaces between words)

• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157

• Use SUBSTITUTE for higher codes

Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3

15

ASCII Characters

http://en.wikipedia.org/wiki/ASCII 12/22/2009

16

Access: Subscript Out of Range - 1

Try These Steps:• Check for spaces in

column headings• Use TRIM (removes

ASCII value 32 = space character)

• Use CLEAN for ASCII code 0-31

• Delete empty right columns

• Remove empty “used” cells:

1.Find end of "used cells": CTRL+SHIFT+END

2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.

17

Access: Subscript Out of Range - 2

• Copy and paste cells into a new workbook. Save. Import into Access.

• Or, save file as CSV, import into Access. = Will see data error.

Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99

18

Access: Type Conversion Failure

• Make sure data types in fields match data types in columns.

• Data like ISBNs are text but can be “read” like numbers.

• Add top row with correct data/type: XXX for ISBN

19

Access: Remove Quotes

• Search for records

with “”: Criteria: LIKE "*" & Chr(34) & "*"

• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string

© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23

20

Access: ISSN Issues

• Find too-short ISSNs: Len([FieldName])<n [9 is good here]

• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));

© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor

21

Access & ARL Stats - 1

• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.

• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.

• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.

22

Access & ARL Stats - 2

• RowCount:Count(*) Number of titles in a set.

• =[YTD Total]*[Cost] Annual cost-per-use.

• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx

23

Access or Excel?

Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings

Excel:• Non-relational• Mostly numeric• Calculations/Statistics

Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx

24

XML - Excel

• Excel can interpret XML

• Data > Get External Data > From XML Data Import

• Format without affecting source data

25

Later Excel: Activate Developer tab through Office logo (upper left)

MarcEdit: XML - 1

• Convert large XML files to Excel

• Specify input, output files

• Choose MARC21XML => MARC

26

MarcEdit: XML - 2• Choose display fields,

input, output files• View, format in Excel

27

MarcEdit: MARC - 1

• Convert large files to local practices

DELETE existing

999 field

910 field(s)…

Copy 035 to 001

Remove (Sc-P) Prefix from 001

ADD

 910 |aDEL SCP ; jc ; 2009/7/8

Field:910

Field data: \\$aDEL SCP ; jc ; 2009/7/8

998 |an

Field: 998

Field data: \\$an

28

MarcEdit: MARC - 2

29

MarcEdit Information

• By Terry Reese

• http://oregonstate.edu/~reeset/marcedit/html/

• MARCEDIT-L listserv at  https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1   

• Regular updates

• Tutorials, templates, scripts

30

ILS: “Global Update”• For records within ILS system

• For universal changes

• “Check website for coverage.”

31

A Better Way? Macros

• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en

• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/

• http://www.lib-stats.org.uk/ (statistics listserv)

[Courtesy Tansy Matthews]

32

Data Quality

Strategies to Improve Data Quality:

1. Identify problems

2. Treat data as an asset

3. Implement quality systems

Principle Activities for Data:

• Acquire• Store• Use

33

Poor Quality Data Indicators

• Uncorrected errors• Redundant data/

processes• Lack of data for

strategizing• Frustration with data,

data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor

34

Treat Data as an Asset

• Inventory data assets• Data = dynamic;

process = asset• Align responsibilities:

acquire, store, use data.

• Establish customer-supplier relationships for data. © The Metropolitan Museum of

Art, “Tournament,” late 16th century, ARTstor

35

Apply Quality Principles

• Create and keep a customer

• Detect and correct errors

• Determine root cause of defects

• Manage the process

• Communicate results• Audit supplier

performance

36

Library Data Quality Policy - 1

Suppliers/Creators:• Understand users,

uses, & requirements• Ensure requirements

are met• Manage data creation

process

Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality

in IT

37

Library Data Quality Policy - 2

Users:• Define requirements,

work with suppliers• Provide feedback• Interpret data

correctly• Use data legitimately• Protect privacy

Logistics:• Determine master

systems• Understand system

limitations• Accessible storage• Match inputs with

needs• Identify key keepers

38

Margaret.hogarth@ucr.edu951-827-2937

Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154

Other Techniques?

39

Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.

http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.

• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.

• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx

• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.

• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.

• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009

40

top related