beautiful research data (structured data and open refine)

54
Beautiful Research Data Kirsta Stapelfeldt, Coordinator UTSC Library’s Digital Scholarship Unit

Upload: digital-scholarship-unit-at-the-utsc-library

Post on 15-Jan-2015

117 views

Category:

Software


2 download

DESCRIPTION

http://serai.utsc.utoronto.ca/rrsi2014 "Unlike traditional academic conferences, the Roots & Routes Summer Institute features a combination of informal presentations, seminar-style discussions of shared materials, hands-on workshops on a variety of digital tools, and small-group project development sessions. The institute welcomes participants from a range of disciplines with an interest in engaging with digital scholarship; technical experience is not a requirement. Graduate students (MA and PhD), postdoctoral fellows and faculty are all encouraged to apply."

TRANSCRIPT

Page 1: Beautiful Research Data (Structured Data and Open Refine)

Beautiful Research

Data

Kirsta Stapelfeldt, Coordinator

UTSC Library’s Digital Scholarship Unit

Page 2: Beautiful Research Data (Structured Data and Open Refine)

In this presentation

● Part One: preparing to create machine-readable data at the onset of a research endeavour

● Part Two: Working with “messy,” datasets

Page 3: Beautiful Research Data (Structured Data and Open Refine)

Benefits of machine-readable data

● Easier to query for new insights● Easier to mount in a computing environment● Easier to share with others

Page 4: Beautiful Research Data (Structured Data and Open Refine)

Just a .csv + Fusion Tables

● Fusion tables is an experimental, web-based chrome app

● Took a spreadsheet that Natalie has been working on and loaded it into the app

● Results have not been massaged at all● We can expect additional benefits from

having structured data in the future

Page 5: Beautiful Research Data (Structured Data and Open Refine)

Part oneIn which you have no research data...yet

Page 6: Beautiful Research Data (Structured Data and Open Refine)

Best Case Scenario

You start by utilizing some best practices

4 Pieces of low-hanging fruit...

Page 7: Beautiful Research Data (Structured Data and Open Refine)

1. No word documents

● database (even a spreadsheet) not .docs● avoid a lot of style information in your

research documents (such as bolding and italicizing text, or moving things to other areas of the page using the tab key or spacebar)

● Why?

Page 8: Beautiful Research Data (Structured Data and Open Refine)

Look beyond the surface.

& n

&nsbp; &nsbp; &nsbp; &nsbp; no thank you!

http://www.bartleby.com/103/33.html

Page 9: Beautiful Research Data (Structured Data and Open Refine)

Beauty is more than browser deep

http://www.gutenberg.org/ebooks/18827

Page 10: Beautiful Research Data (Structured Data and Open Refine)

2. Use consistent formats for elements such as date & language

● i.e. dates recorded consistently where possible (05/25/2014)

Page 11: Beautiful Research Data (Structured Data and Open Refine)

3. Taxonomies & Standards

● use controlled vocabularies for keywords, place names, person names of relevanceo using an open format for a place name can make

geocoding much easiero stay consistent in a given language

Page 12: Beautiful Research Data (Structured Data and Open Refine)

4. Text Encoding

● Ensure you are using Unicode (UTF-8)

● How do you know ? o Notepad can be your friendo Test a sample between systems

Page 13: Beautiful Research Data (Structured Data and Open Refine)

http://www.string-functions.com/encodingerror.aspx

Page 14: Beautiful Research Data (Structured Data and Open Refine)

Changing the way you think about your research

processDraw a picture

Page 15: Beautiful Research Data (Structured Data and Open Refine)

1. Think small.

Atomistic information (what is the smallest meaningful unit of information you are collecting?)

For example:● A person’s name, religion, and DOB● Mention of a location or name● Repeated occurrence

Page 16: Beautiful Research Data (Structured Data and Open Refine)

2. Connect the dots.

What are the relationships between your data elements?

Useful tool: The Entity Relationship Diagram

Page 17: Beautiful Research Data (Structured Data and Open Refine)

Draft Dragomans Content Model

Page 18: Beautiful Research Data (Structured Data and Open Refine)

Crow’s Foot Notation

Exercise - Building an ERD

Page 19: Beautiful Research Data (Structured Data and Open Refine)
Page 20: Beautiful Research Data (Structured Data and Open Refine)

Part twoYour data is a mess

Page 21: Beautiful Research Data (Structured Data and Open Refine)

Tools for dealing with messy data

● Regular Expressions● Open Refine

Page 22: Beautiful Research Data (Structured Data and Open Refine)

Regular Expressions: Find & Replace on Steroids

● Available in most productivity suites (iWork, Microsoft Word, Libre Office/Open Office)

● Often syntax is a little different across systems

Page 23: Beautiful Research Data (Structured Data and Open Refine)

“The regular expression(?<=\.) {2,}(?=[A-Z]) matches at least two spaces occurring after period (.) and before an

upper case letter as highlighted in the text above.”

Page 24: Beautiful Research Data (Structured Data and Open Refine)
Page 25: Beautiful Research Data (Structured Data and Open Refine)
Page 26: Beautiful Research Data (Structured Data and Open Refine)

Open Refine

● Similar to spreadsheet software

● Installed on your computer, but used through your browser

● “Power Tool” for messy data

Following will draw heavily from this lesson - http://programminghistorian.org/lessons/cleaning-data-with-openrefine (Thanks to Seth van Hooland, Ruben Verborgh, Max De Wilde)

Page 27: Beautiful Research Data (Structured Data and Open Refine)

Base Assumption of Open Refine

● You have “structured data” ● some consistent and machine-readable

logic has been applied to your datao Excel, .csv, XML

● you may have structured data and not know ito Check export options from any software you

regularly use

Page 28: Beautiful Research Data (Structured Data and Open Refine)

1. Remove duplicates 2. Remove blanks3. Make data atomistic (smallest meaningful

unit)4. Keep terms/formats consistent

Page 30: Beautiful Research Data (Structured Data and Open Refine)

Set appropriate options and “Create Project”

Page 31: Beautiful Research Data (Structured Data and Open Refine)

Project is created with 75,814 rows.

Page 32: Beautiful Research Data (Structured Data and Open Refine)

1. Look for Blank

Records

See if any RecordIDs are blank by using a numeric facet

Page 33: Beautiful Research Data (Structured Data and Open Refine)

“Non-numeric” rows are blank.

Page 34: Beautiful Research Data (Structured Data and Open Refine)

Hovering over the cell makes an “edit” link visible

Page 35: Beautiful Research Data (Structured Data and Open Refine)

The “blank” fields actually contained a single whitespace. You can delete the whitespace and then select “Apply to All Identical Cells” -

Page 36: Beautiful Research Data (Structured Data and Open Refine)

A confirmation message will always show up noting what you’ve done, and giving you a chance to “undo”

Page 37: Beautiful Research Data (Structured Data and Open Refine)

2. Look for Duplicate Records using Record ID

(since it should be unique)

Page 38: Beautiful Research Data (Structured Data and Open Refine)

Sorting is a visual tool only unless you “Reorder rows permanently”

Page 39: Beautiful Research Data (Structured Data and Open Refine)

“Blank down” will delete the second instance of a duplicated “Record ID”

Page 40: Beautiful Research Data (Structured Data and Open Refine)

Then, we can facet the “Record ID” column by blank records.

Page 41: Beautiful Research Data (Structured Data and Open Refine)

the “true” facet contains all the blank records.

Page 42: Beautiful Research Data (Structured Data and Open Refine)

Clicking the “true” link will narrow to the blank records, which can then be removed.

Page 43: Beautiful Research Data (Structured Data and Open Refine)

3. Make data atomistic

“Category” contains numerous categories separated by the “|” character

Page 44: Beautiful Research Data (Structured Data and Open Refine)

You can tell the system to split the cells using this character.

Page 45: Beautiful Research Data (Structured Data and Open Refine)

Now only single categories appear.

Page 46: Beautiful Research Data (Structured Data and Open Refine)

Creating a text facet on “Categories” brings up all the options in this column.

We can “cluster” to detect similar terms that might have variances in spelling or capitalization

4. Make terms consistent

Page 47: Beautiful Research Data (Structured Data and Open Refine)

This interface allows you to select which term is authoritative. You can then merge terms together.

Page 48: Beautiful Research Data (Structured Data and Open Refine)

a couple of additional features...

The “Undo/Redo” tab allows you to back up in steps to the creation of your project, if you make a mistake.

Page 49: Beautiful Research Data (Structured Data and Open Refine)

A “text filter” can allow you to search in a column (by regular expression too!)

Page 50: Beautiful Research Data (Structured Data and Open Refine)

Refine has its own set of regular expressions that can be used to perform functions on data.

Page 51: Beautiful Research Data (Structured Data and Open Refine)

https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

A full list of these is available on Github.

Page 52: Beautiful Research Data (Structured Data and Open Refine)

Finally, projects can be exported as Refine projects, but also in a number of additional structured formats.

Do this frequently.

Page 53: Beautiful Research Data (Structured Data and Open Refine)

Structured data is beautiful data. Make a plan to create structured data during your research

Clean legacy data or data you inherit, by becoming a regular expression (regex) expert and/or using a tool like OpenRefine.

Page 54: Beautiful Research Data (Structured Data and Open Refine)

Go to your library or ITS department to see if you can get support. Thanks for listening to me!