driscoll bi sig_15_jun2010
DESCRIPTION
TRANSCRIPT
WINNINGWITH
BIG DATA
Michael Driscoll@dataspora
SDForum BI SIGJune 15, 2010
Secrets of the Successful
Data Scientist
WHY DATAMATTERSNOW
THE INDUSTRIALAGE OF DATA
WHAT IS BIG DATA?
Data that is distributed.
class size manage with how it fits examples
small < 10 GB Excel, Rfits in one machine’s memory
thousands of sales figures
medium 10GB-1TB indexed files, monolothic DB
fits on one machine’s disk millions of web pages
Big > 1TBHadoop,
distributed DBs
stored across many
machinesbillions of web clicks
WHAT ISDATA SCIENCE?
WHY DATA SCIENCEIS SEXY
+ =
“The sexy job in the next ten years will be statisticians…”- Hal Varian
data model
1000 bytes 2 bytes
9 WAYS TO WINWITH DATA
1. CHOOSE THERIGHT TOOL
You don’t need a chainsaw to cut butter.
2. COMPRESS EVERYTHING
The world is IO-bound.
mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh [email protected] "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"
3. SPLIT UPYOUR DATA
Split, apply, combine.
4. WORK WITH SAMPLES
Big Data is heavy, samples are light.
perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv
5. USESTATISTICS
6. COPYFROM OTHERS
Use open source.
git clone git://github.com/kevinweil/hadoop-lzo
Charts are compositions,not containers.
7. ESCHEW CHART TYPOLOGIES
8. COLOR WITH CARE
Color can enhance or insult.
9. TELL A STORY
People are listening.
ONE SUCCESSSTORY
WHY DO TELCO CUSTOMERS LEAVE?
Sign up Leave
Goal: “less churn.”
DATA:BILLIONSOF CALLS
… and millions of callers.
… a difference,but not significant.
DOES CALL QUALITYMATTER?
Hmmm...
WHAT ABOUTSOCIALNETWORKS?
… but is it predictive?
BUILD THE CALL GRAPH
April
EVOLUTION OF A CALL GRAPH
May
EVOLUTION OF A CALL GRAPH
June
EVOLUTION OF A CALL GRAPH
July
EVOLUTION OF A CALL GRAPH
when a cancellationoccurs in a call network.
700% INCREASEIN CHURN
FINAL THOUGHTS
Big Data Dedicated RDBMS
Analytics(R, SPSS, SAS, SAP)
Data Products (Content Filters, Rec Engines)
Data
Actions
Insights
THE BIG DATA STACK
THANKS!QUESTIONS?
Michael [email protected]
@dataspora on Twitterhttp://www.dataspora.com/blog
SDForum BI SIGJune 15, 2010