processing big data - ulisboausers.isr.tecnico.ulisboa.pt/~csoares/bigdatadocs/eda.pdfprocessing big...
TRANSCRIPT
Processing Big DataExploratory Data Analysis
Exploratory Data Analysis
What it is
Any method of looking at data that does not include formal statistical modeling and inference
NYC OD: Emergency Response Incidents
3
Why it matters
Confirmatory statistical analyses are based on models.
4
Why it matters
Confirmatory statistical analyses are based on models.
y = Ax+N (0,�2)
Structural component
Random component
4
Why it matters
Confirmatory statistical analyses are based on models.
y = Ax+N (0,�2)
Structural component
Random component
Signal Noise
4
Why it matters
Models are not perfect representations of the real world.
5
Why it matters
Models are not perfect representations of the real world.
5
Why it matters
Models are not perfect representations of the real world.
But some are close enough to be useful!
https://commons.wikimedia.org/w/index.php?curid=521370
5
Why it mattersWhat is close enough to reality?
6
Why it mattersWhat is close enough to reality?
Statistical inference always depends on model assumptions about the data.
6
Why it mattersWhat is close enough to reality?
Statistical inference always depends on model assumptions about the data.
6
Use EDA for:
7
Use EDA for:
• Detecting data noise
7
Use EDA for:
• Detecting data noise
• Checking assumptions
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
• Determining relationships between the explanatory variables
7
Use EDA for:
• Detecting data noise
• Checking assumptions
• Selecting data models
• Determining relationships between the explanatory variables
• Determining relationships between explanatory and outcome variables
7
TechniquesLook at the raw data
• What values the variables take? • How often variables take those values?
Compute summary statistics
• Look at the top and bottom of your data. • How much missing data? • How noisy is the data?
Visualize• Show comparisons • Show structure • Show multivariate data8
311 Service Requests from 2010 to Present
Non-emergency City services and information9
How to: Look at the raw data
Data sets are in general huge
Do not load a huge file into memory
10
How to: Look at the raw data
Data sets are in general huge
Do not load a huge file into memory
Example: ~500M tweets/day x 140 B x 7 days ~500 GB of RAM10
How to: Look at the raw dataOld school shell tools for raw data inspection
see more @ https://pixorblog.wordpress.com/2016/06/24/csv-files-and-bash/
11
Summary stats
Mean, mode, median
Variance
Range, max, min
Correlation
…
12
Beware of summary stats
Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. From wikipedia
13
Beware of summary stats
https://www.autodeskresearch.com/publications/samestats
14
Data Visualization
Data points across some features
Features across all data points
Histograms
…
15
Followup
Do you have the right question?
Refine your question
Do you have the right data?
Do you need other data?More data?
Collect data according to needs
16