data 100 · Ømanipulated using linear algebra ws fields/attributes/ features/columns. how are...

54
Data 100 Lecture 4: Data Cleaning & Exploratory Data Analysis Slides by: Joseph E. Gonzalez, Deb Nolan, Joe Hellerstein & Fernando Perez [email protected] [email protected] [email protected] [email protected] ?

Upload: others

Post on 20-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data 100Lecture 4: Data Cleaning &Exploratory Data Analysis

Slides by:

Joseph E. Gonzalez, Deb Nolan, Joe Hellerstein & Fernando Perez

[email protected]

[email protected]

[email protected]

[email protected]

?

Page 2: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Last Weekhttps://www.nbcnews.com/news/world/giant-pandas-are-no-longer-endangered-n643336

JupyterNotebooks

Page 3: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Pandas and Jupyter Notebooks

Ø Reviewed Jupyter Notebook Environment

Ø Introduced DataFrame conceptsØ Series: A named column of data with an indexØ Indexes: The mapping from keys to rowsØ DataFrame: collection of series with common index

Ø Dataframe access methodsØ Filtering on predicates and slicingØ df.loc: location by index labelØ df.iloc: location by integer addressØ groupby & pivot (we will review these again today)

Page 4: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Today

Page 5: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Congratulations!

You have collectedor been given a box of data?

What do you do next?

Page 6: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

?Question &Problem

Formulation

Data Acquisition

Exploratory Data Analysis

Predictionand

Inference

Page 7: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data Acquisition

Exploratory Data Analysis

Topics For Lecture TodayØ Understanding the Data

Ø Data Cleaning Ø Exploratory Data Analysis (EDA)Ø Basic data visualization

Ø Common Data Anomalies Ø … and how to fix them

Page 8: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Exploratory DataAnalysis

Data Cleaning

… the infinite loop of data science.

Page 9: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Ø The process of transforming raw data to facilitate subsequent analysis

Ø Data cleaning often addressesØ structure / formattingØ missing or corrupted valuesØ unit conversionØ encoding text as numbersØ …

Ø Sadly data cleaning is a big part of data science…

Data Cleaning

Page 10: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

subsequent analysis

Ø Data cleaning often addressesØ structure / formattingØ missing or corrupted valuesØ unit conversionØ encoding text as numbersØ …

Ø Sadly data cleaning is a big part of data science…

Page 11: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data Cleaning

… the infinite loop of data science.

Exploratory Data Analysis

Page 12: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

The process of transforming, visualizing, and summarizing data to:

Ø Build/confirm understanding of the data and its provenanceØ Identify and address potential issues in the dataØ Inform the subsequent analysisØ discover potential hypothesis … (be careful)

Ø EDA is an open ended analysisØ Be willing to find something surprising

Exploratory Data Analysis (EDA)

“Getting to know the data”

Page 13: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data Analysis & Statistics, Tukey 1965Image from LIFE Magazine

John TukeyPrinceton Mathematician & Statistician

Introduced Ø Fast Fourier TransformØ “Bit” : binary digitØ Exploratory Data Analysis

Early Data Scientist

Page 14: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data Analysis & Statistics, Tukey 1965Image from LIFE Magazine

EDA is like detective work

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.”

Page 15: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

50 Years of Data Science D. Donoho, 2017

“More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’…

50 Years of Data Science – D. Donohohttps://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1384734

Page 16: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

What should we look for?

Page 17: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 18: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 19: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Rectangular DataWe prefer rectangular data for data analysis (why?)Ø Regular structures are easy manipulate and analyzeØ A big part of data cleaning is about

transforming data to be more rectangular

Two kinds of rectangular data: Tables and Matrices (what are the differences?)

1. Tables (a.k.a. data-frames in R/Python and relations in SQL)Ø Named columns with different typesØ Manipulated using data transformation languages (map, filter, group by, join, …)

2. MatricesØ Numeric data of the same typeØ Manipulated using linear algebra

Reco

rds/

Row

s

Fields/Attributes/Features/Columns

Page 20: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

How are these data files formatted?TSVTab separated values

CSVComma separated values

JSON

Which is the best?

Page 21: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Comma and Tab Separated Values FilesØ Tabular data where

Ø records are delimited by a newline: “\n”, “\r\n”Ø Fields are delimited by ‘,’ (comma) or ‘\t’ (tab)

Ø Very Common!

Ø Issues?Ø Commas, tabs

in recordsØ QuotingØ …

Page 22: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

JavaScript Object Notation (JSON)

Ø Widely used file format for nested dataØ Natural maps to python dictionaries (many tools for loading)Ø Strict formatting ”quoting” addresses some issues in CSV/TSV

Ø IssuesØ Each record can have different fieldsØ Nesting means records can contain records à complicated

Page 23: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

XML (another kind of nested data)<catalog>

<plant type='a'><common>Bloodroot</common><botanical>Sanguinaria canadensis</botanical><zone>4</zone><light>Mostly Shady</light><price>2.44</price><availability>03/15/2006</availability><description>

<color>white</color><petals>true</petals>

</description><indoor>true</indoor>

</plant>…</catalog>

Nested structure

We will study XML later in the class

Page 24: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

169.237.6.168 - - [8/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

Log data Is this a csv file? tsv?JSON/XML?

Page 25: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Data can be split across files and reference other data.

Page 26: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Structure: KeysØ Often data will reference other

pieces of data

Ø Primary key: the column or set of columns in a table that determine the values of the remaining columnsØ Primary keys are uniqueØ Examples: SSN, ProductIDs, …

Ø Foreign keys: the column or sets of columns that reference primary keys in other tables.

OrderNum ProdID Quantity1 42 31 999 22 42 1

OrderNum CustID Date1 171345 8/21/20172 281139 8/30/2017

ProdID Cost42 3.14999 2.72

Purchases.csv

Products.csv

Orders.csv

CustID Addr171345 Harmon.. 281139 Main ..

Customers.csv

Foreign Key

Primary Key

Page 27: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Merging/joining data across tables

Page 28: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Joining two tablesOrderNum ProdID Name1 42 Gum2 999 NullFood2 42 Towel

OrderId Cust Name Date1 Joe 8/21/20172 Arthur 8/14/2017

x

OrderNum ProdID Name OrderId Cust Name Date1 42 Gum 1 Joe 8/21/20171 42 Gum 2 Arthur 8/14/20172 999 NullFood 1 Joe 8/21/20172 999 NullFood 2 Arthur 8/14/20172 42 Towel 1 Joe 8/21/20172 42 Towel 2 Arthur 8/14/2017

Left “key” Right “key”

Drop rows that don’t match on the key

Page 29: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Joining two tablesOrderNum ProdID Name1 42 Gum2 999 NullFood2 42 Towel

OrderId Cust Name Date1 Joe 8/21/20172 Arthur 8/14/2017

x

OrderNum ProdID Name OrderId Cust Name Date1 42 Gum 1 Joe 8/21/20171 42 Gum 2 Arthur 8/14/20172 999 NullFood 1 Joe 8/21/20172 999 NullFood 2 Arthur 8/14/20172 42 Towel 1 Joe 8/21/20172 42 Towel 2 Arthur 8/14/2017

Left “key” Right “key”

Drop rows that don’t match on the key

OrderNum ProdID Name OrderId Cust Name Date1 42 Gum 1 Joe 8/21/20172 999 NullFood 2 Arthur 8/14/20172 42 Towel 2 Arthur 8/14/2017

Page 30: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Pandas Merge Demo

https://www.popsci.com/pandas-have-cute-markings-because-their-food-supply-sucks

Page 31: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Questions to ask about StructureØ Are the data in a standard format or encoding?

Ø Tabular data: CSV, TSV, Excel, SQLØ Nested data: JSON or XML

Ø Are the data organized in “records”?Ø No: Can we define records by parsing the data?

Ø Are the data nested? (records contained within records…)Ø Yes: Can we reasonably un-nest the data?

Ø Does the data reference other data?Ø Yes: can we join/merge the data

Ø What are the fields in each record?Ø How are they encoded? (e.g., strings, numbers, binary, dates …)Ø What is the type of the data?

Page 32: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Kinds of

Quantitative DataCategorical Data

Ordinal Nominal

Data

Examples:• Price• Quantity• Temperature• Date• …

Numbers with meaning ratios or intervals.

Examples:• Preferences• Level of education• …

Examples:• Political Affiliation• Product Type• Cal Id• …

Categories with orders but no consistent meaning if

magnitudes or intervals

Categories with no specific ordering.

Note that data categorical data can also be numbers and quantitative data

may be stored as strings.

Page 33: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Structure: Field TypesØ Quantitative Data: data with meaningful differences or ratios

Ø Continuous: weight, temperature, volumeØ Discrete: counts, …Ø Visualization: histograms and box plots

Ø Ordinal Data: data where relative order mattersØ Differences between entries may not be the sameØ Examples:

Ø level of education: [BS, MS, PhD]Ø Preferences: [Dislike, Like, Must Have]

Ø Visualization: Bar charts (sorted)

Ø Nominal Data: data with no numerical meaningØ Examples: names, political affiliation, eye color, Ø It may be encoded as numbers …Ø Visualization: Bar charts

Page 34: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

QuizØPrice in dollars of a product?

Ø (A) Quantitative, (B) Ordinal, (C) Nominal

Ø Star Rating on Yelp?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

ØDate an item was sold?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

ØWhat is your Credit Card Number?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

http://bit.ly/ds100-fa18-eda

Page 35: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

QuizØPrice in dollars of a product?

Ø (A) Quantitative, (B) Ordinal, (C) Nominal

Ø Star Rating on Yelp?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

ØDate an item was sold?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

ØWhat is your Credit Card Number?Ø (A) Quantitative, (B) Ordinal, (C) Nominal

http://bit.ly/ds100-fa18-eda

Page 36: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 37: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 38: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

GranularityØ What does each record represent?

Ø Examples: a purchase, a person, a group of users

Ø Do all records capture granularity at the same level?Ø Some data will include summaries as records

Ø If the data are coarse how was it aggregated?Ø Sampling, averaging, …

Ø What kinds of aggregation is possible/desirable? Ø From individual people to demographic groups? Ø From individual events to totals across time or regions?Ø Hierarchies (city/county/state, second/minute/hour/days)

Ø Understanding and manipulating granularity can help reveal patterns.

Page 39: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Granularity and KeysØ The primary key defines what the

record represents à Granularity

Ø What is the granularity of theseexample tables?Ø Purchases.csv: PK=(OrderNum + ProdID)

è Each Item in an orderØ Orders.csv: PK = OrderNum à an order

Ø How might we adjust the granularity?Ø Aggregation: count, mean, median, var,

groupby, pivot …

OrderNum ProdID Quantity1 42 31 999 22 42 1

OrderNum CustID Date1 171345 8/21/20172 281139 8/30/2017

ProdID Cost42 3.14999 2.72

Purchases.csv

Products.csv

Orders.csv

CustID Addr171345 Harmon.. 281139 Main ..

Customers.csv

Page 40: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Reviewing Group By and Pivot

Page 41: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Group By

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

Key Data

A 2

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

A 2

Page 42: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Group ByKey Data

A 3A 1A 2

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

A 2

B 1

C 4

B 5

C 9

B 6

C 5

Page 43: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Group ByKey Data

A 3A 1A 2

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

A 2

B 1

C 4

B 5

C 9

B 6

C 5

Split intoGroups

Page 44: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Group ByKey Data

A 3A 1A 2

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

A 2

B 1

C 4

B 5

C 9

B 6

C 5

Split intoGroups

AggregateFunction

AggregateFunction

AggregateFunction

A 6

B 12

C 18

A 6

B 12

C 18

Page 45: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Group ByKey Data

A 3A 1A 2

A 3

B 1

C 4

A 1

B 5

C 9

B 6

C 5

A 2

B 1

C 4

B 5

C 9

B 6

C 5

Split intoGroups

AggregateFunction

AggregateFunction

AggregateFunction

A 6

B 12C 18

A 6B 12

C 18

MergeResults

Page 46: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Pivot Key

R Data

B 1V

C 4U

A 1V

B 5U

C 9V

A 2U

B 6V

D 5U

KeyC

A 3U

B 1V

C 4U

A 1V

B 5U

C 9V

A 2U

B 6V

D 5U

A 3U

Page 47: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Pivot Key

R Data

B 1V

C 4U

A 1V

B 5U

C 9VA 2U

B 6V

D 5U

KeyC

A 3U

C 4U

A 1V

B 5U

C 9V

B 1VB 6V

D 5U

A 2UA 3U

Split intoGroups

AggregateFunction A 5U

AggregateFunction A 1V

AggregateFunction B 5U

AggregateFunction B 7V

AggregateFunction C 4U

AggregateFunction C 9V

AggregateFunction D 5U

Page 48: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Pivot Aggregate

Function A 5U

AggregateFunction A 1V

AggregateFunction B 5U

AggregateFunction B 7V

AggregateFunction C 4U

AggregateFunction C 9V

AggregateFunction D 5U

A 5U

A 1V

B 5U

B 7V

C 4U

C 9V

D 5U

Page 49: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Manipulating Granularity: Pivot Aggregate

Function A 5U

AggregateFunction A 1V

AggregateFunction B 5U

AggregateFunction B 7V

AggregateFunction C 4U

AggregateFunction C 9V

AggregateFunction D 5U

A 5

U

A 1

V

B 5

U

7VC 4

U9

VD 5

U

V

Need to address missing values

Page 50: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Demohttp://abcnews.go.com/Lifestyle/silly-baby-panda-falls-flat-face-public-debut/story?id=42481478

Page 51: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 52: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Key Data Properties to Consider in EDA

Ø Structure -- the “shape” of a data file

Ø Granularity -- how fine/coarse is each datum

Ø Scope -- how (in)complete is the data

Ø Temporality -- how is the data situated in time

Ø Faithfulness -- how well does the data capture “reality”

Page 53: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

Scope

Ø Does my data cover my area of interest?Ø Example: I am interested in studying crime in California but I

only have Berkeley crime data.

Ø Is my data too expansive?Ø Example: I am interested in student grades for DS100 but have

student grades for all statistics classes.Ø Solution: Filtering à Implications on sample?

Ø If the data is a sample I may have poor coverage after filtering …

Ø Does my data cover the right time frame?Ø More on this in temporality …

Page 54: Data 100 · ØManipulated using linear algebra ws Fields/Attributes/ Features/Columns. How are these data files formatted? TSV Tab separated values CSV Comma separated values JSON

To be continued … In the next lecture