session 05 cleaning and exploring

36
Cleaning and Exploring Data Datascience session 5

Upload: bodaceacat

Post on 11-Apr-2017

141 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Session 05 cleaning and exploring

Cleaning and Exploring Data

Datascience session 5

Page 2: Session 05 cleaning and exploring

Lab 5: your 5-7 things

Data cleaning

Basic data cleaning with Python

Using OpenRefine

Exploring Data

The Pandas library

The Seaborn library

The R language

Page 3: Session 05 cleaning and exploring

Data Cleaning

Page 4: Session 05 cleaning and exploring

Algorithms want their data to be:

Machine-readable

Consistent format (e.g. text is all lowercase)

Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these)

No whitespace hiding in number or text cells

No junk characters

No strange outliers (e.g. 200 year old living people)

In vectors and matrices

Normalised

Page 5: Session 05 cleaning and exploring

Cleaning with Python

Page 6: Session 05 cleaning and exploring

Cleaning Strings

Removing capitals and whitespace:

mystring = " CApiTalIsaTion Sucks "

mystring.lower().strip()

original text is - CApiTalIsaTion Sucks -lowercased text is - capitalisation sucks -Text without whitespace is -capitalisation sucks-

Page 7: Session 05 cleaning and exploring

Regular Expressions: repeated spaces

There’s a repeated space in capitalisation sucks

import re

re.sub(r'\s', '.', 'this is a string')re.sub(r'\s+', '.', 'this is a string')

'this.is..a.string''this.is.a.string'

Page 8: Session 05 cleaning and exploring

Regular Expressions: junk

import re

string1 = “This is a! sentence&& with junk!@“

cleanstring1 = re.sub(r'[^\w ]', '', string1)

This is a sentence with junk

Page 9: Session 05 cleaning and exploring

Converting Date/Times

European vs American? Name of month vs number? Python comes with a bunch of date reformatting libraries that can convert between these. For example:

import datetime

date_string = “14/03/48"

datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')

Page 10: Session 05 cleaning and exploring

Cleaning with Open Refine

Page 11: Session 05 cleaning and exploring

Our input file

Page 12: Session 05 cleaning and exploring

Getting started

Page 13: Session 05 cleaning and exploring

Inputting data

Page 14: Session 05 cleaning and exploring

Cleaning up the import

Page 15: Session 05 cleaning and exploring

The imported data

Page 16: Session 05 cleaning and exploring

Cleaning up columns

Page 17: Session 05 cleaning and exploring

Facets

Page 18: Session 05 cleaning and exploring

Exploring Data

Page 19: Session 05 cleaning and exploring

Exploring Data

Eyeball your data

Plot your data - visually look for trends and outliers

Get the basics statistics (mean, sd etc) of your data

Create pivot tables to help understand how columns interact

Do more cleaning if you need to (e.g. those outliers)

Page 20: Session 05 cleaning and exploring

Exploring with Pandas

Page 21: Session 05 cleaning and exploring

Reading in data files with Pandas

read_csv

read_excel

read_sql

read_json

read_html

read_stata

read_clipboard

import pandas as pddf = pd.read_stata('example_data/AG_SEC12A.dta')

Page 22: Session 05 cleaning and exploring

Eyeballing rows

How many rows are there in this dataset?

len(df)

What do my data rows look like?

df.head(5)

df.tail()

df[10:20]

Page 23: Session 05 cleaning and exploring

Eyeballing columns

What’s in these columns?

df[‘sourceid’]

df[[‘sourceid’,’ag12a_01','ag12a_02_2']]

What’s in the columns when these are true?

df[df.ag12a_01 == ‘YES’]

df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]

Page 24: Session 05 cleaning and exploring

Summarising columns

What are my column names and types?

df.columns

df.dtypes

Which labels do I have in this column?

df['ag12a_03'].unique()

df['ag12a_03'].value_counts()

What are my columns’ mean, standard deviation etc?

df.describe

Page 25: Session 05 cleaning and exploring

Pivot Tables: Combining data from one dataframe

● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])

Page 26: Session 05 cleaning and exploring

Merge: Combining data from multiple frameslongnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])})

merged_data = pd.merge( left=popstats, right=longnames, left_on='Country/territory of residence', right_on='country')merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]

Page 27: Session 05 cleaning and exploring

Left Joins: Keep everything from the left table… longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])})

merged_data = pd.merge( left=popstats, right=longnames, how='left', left_on='Country/territory of residence', right_on='country')merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]

Page 28: Session 05 cleaning and exploring

Normalising

Use pd.stack()

Page 29: Session 05 cleaning and exploring

The Seaborn Library

Page 30: Session 05 cleaning and exploring

The Iris dataset

import seaborn as sns

iris = sns.load_dataset('iris')

Page 31: Session 05 cleaning and exploring

Visualising Iris data with Seabornsns.pairplot(iris, hue='species', size=2)

Page 32: Session 05 cleaning and exploring

Exploring with R

Page 33: Session 05 cleaning and exploring

R

Matrix analysis (similar to Pandas)

Good at:

Rapid statistical analysis (4000+ R libraries)

Rapidly-created static graphics

Not so good at:

Non-statistical things (e.g. GIS data analysis)

Page 34: Session 05 cleaning and exploring

Running R code

● Running R files:

○ From the terminal window: “R <myscript.r —no-save”

○ From inside another R program: source('myscript.r')

● Writing your own R code:

○ iPython notebooks: create “R” notebook (instead of python3)

○ Terminal window: type “r” (and “q()” to quit)

○ Rstudio: click on Rstudio tool

Page 35: Session 05 cleaning and exploring

Exercises

Page 36: Session 05 cleaning and exploring

Code

Try running the Python and R code in the 5.x set of notebooks