introduction to data cleaning with spreadsheets

Post on 21-Aug-2014

167 Views

Category:

Government & Nonprofit

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at School of Data training conducted in collaboration with the Open Data PH Taskforce in the Philippines, May 2014.

TRANSCRIPT

An Introduction to data cleaning with spreadsheets

Anders Pedersen, @anpe

School of Data

Spreadsheets: The beginning of each and every data story

• Which were the top growth sectors in this quarter?

• What was the crime in the capital region in 2013 compared to 2012?

• Is there a house bubble waiting around the corner?

It is time for journalists themselves to tame this beast called spreadsheets!

Spreadsheets: Excel or google docs

Some basic terminology• data is organized in rows and columns

(rows go across the page, columns go top down)

• each field holding data is called a cell• Rows are numbered, • columns are referred to by letters• each cell has column and a row, or a

specific code (e.g. A1 is the top left cell

Some key features to explore today• Sorting and filtering• Basic formulas• Pivot tables

Tricky bits:- don’t include summaries in pivot table- pivot tables cannot remember when you change your data

Data sources for exercise

• Education: Secondary school enrollment for 2012 from Data.gov.ph http://data.gov.ph/catalogue/dataset/sy-2012-enrollment-data-secondary

Sorting - finding the best and the worst • The 10 best paid sectors• The 10 oldest cities• The 10 poorest countries• …

• If excel is a tool box for journalists, sorting is the hammer!

How to sort

• 1) Mark all your data• 2) In the Data tab go to sort range

Sorting...

• 3) Check the Data hasheader row check box• 4) Select the column you want to sort

Filtering - getting a better sense of your data• 1) Turn on Filtering

via the Data tab (Data → Filter)

Filtering...• 2) Filter options now appear at top

Filtering...• 3) Now click on the • blue triangular arrow

Filtering...• 4) Select the sectionyou wish to filter

Filtering...• 5) A green arrowwill now appear on topof the column

Moving forward!

• Sorting and filtering - check!• Basic formulas • Pivot tables

Basic formulas• Let us know try to sum up some of the

values in the dataset…

• What is it good for: when you do analysis and when you need to check if calculations by your colleagues are right

Basic formulas• Go to column H: In the second row (cell H2), type “=sum(f2+g2)”

Basic formulas• We now have a sum

• Now try to see if this cell can be calculated for average “=average(f2:g2)”

Basic formulas• You can also copy your calculations across

cells

Now only Pivot tables to go• Sorting and filtering - check!• Basic formulas - check!• Pivot tables

Pivot tables• finding stories inside datasets

• particularly well fitting for organised datasets with clear categories and sub-categories

Pivot tables• Mark the full area of the dataset• Go to Data → Pivot table report

Pivot tables• Pivot tables allows you to work on rows,

column values and filters• We start by droppinga column header into Rows • Then we drop one of our value columns into Values

Basic formulas• We now have a nice summary of the budget

for each department

Filtering pivot tables• We can now go ahead and filter the Pivot

table• Add the column you wish to filter by

Filtering pivot tables• Then select one or more categories withinthe column you wish to keep

Pivot tables• We can finally add several value columns to

the pivot table

Exercises• Find the sectors of the national budget that

grew the most in percentage• Identify the budget lines, which had the

biggest absolute increase in the budget• Generate a pivot table based on the

national budget comparing 2014 and 2013 in specific sectors

top related