data science: not just for big data
DESCRIPTION
From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by: David Smith, Data Scientist at Revolution Analytics, and Gregory Piatetsky, Editor, KDnuggets These are the slides for David Smith's portion of the presentation. Watch the full webinar at: http://www.kalido.com/data-science.htmTRANSCRIPT
Revolution Confidential
Data ScienceNot just for big
data!David SmithRevolution Analytics@revodavid
October 16, 2013
Revolution Confidential
2
Big Data: the new oil?
Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0
Revolution Confidential
3
Big Data is just raw material
Data Distillation Extract quantities of interest Find complete cases Derive missing information
Big Data Pitfalls: Data cleanliness & accuracy Observational bias
Do the data I have represent the population I’m interested in?
Revolution Confidential
4
Surveys & Experiments
Even with Big Data, the data you need isn’t always in the building!
… so ask (survey)! Survey design Stratified sampling
… or experiment! A/B Testing Experimental Design
Revolution Confidential
5
Data Exploration & Visualization
Limited by pixels Big data = a big black
blob Extract signal from
noise Aggregations Heat maps Smoothing Small multiples
Revolution Confidential
6
Statistical Modeling & Forecasting
You don’t always need big data Sampling can help with observational bias
Model selection Feature extraction Confounding? Interactions?
Model validation Overfitting
Prediction Extrapolation Confidence
http://xkcd.com/605/
Revolution Confidential
7
Summary
Big Data is great, but think of it as the “raw materials” for data science After refining, “big” isn’t always so “Big”
Use statistical insight to avoid pitfalls: Inferences: Observational bias / Sampling bias Predictions: Confounding / Overfitting Think about variances and means (risk!)
Some data scientists may miss these issues Look for statistical expertise
Further reading: ComputerWorld: 12 predictive analytics screw-ups