data analysis - making big data work
TRANSCRIPT
Data Analysis Making Big Data Work
David Chiu
2014/11/24
About Me
Founder of LargitData
Ex-Trend Micro Engineer
ywchiu.com
Big Data & Data Science
US Election Prediction
4
World Cup Prediction
Hurricane Prediction
Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Being A Data Scientist, You Need to Know
That Much? Seriously?
Statistic
Single Variable、Multi Variable、ANOVA
Data Munging
Data Extraction, Transformation, Loading
Data Visualization
Figure, Business Intelligence
Required Skills
What You Probably Need Is A Team
Business Analyst Knowing how to use different tools under
different circumstance
Statistician How to process
big data?
DBA How to deal with unstructured data
Software Engineer
Knowing how to user statistics
Four Dimension
12
Single Machine Memory R Local File
Cloud Distributed Hadoop HDFS
Statistics Analysis Linear Algebra
Architect Management Standard
Concept MapReduce Linear Algebra Logistic Regression
Tool Hadoop PostgreSQL R
Analyst How to use these tools
Hackers R Python Java
“80% are doing summing and averaging”
Content
1. Data Munging
2. Data Analysis
3. Interpret Result
What Data Scientists Do?
Application of Data Analysis
Text Mining
Classify Spam Mail
Build Index
Data Search Engine
Social Network
Analysis
Finding Opinion
Leader
Recommendation
System
What user likes?
Opinion Mining
Positive/Negative
Opinion
Fraud Analysis
Credit Card Fraud
Feed data to computer
Make Computer to Do Analysis
Let Computer Predict For You
Predictive Analysis
Learn from experience (Data), to predict future
behavior
What to Predict?
e.g. Who is likely to click on that ad?
For What?
e.g. According to the click possibility and revenue to
decide which ad to show.
Predictive Analysis
Customer buying beer will also buy pampers?
People are surfing telephone fee rate are likely
to switch its vendor
People belong to same group are tend to have
same telecom vendor
Surprising Conclusion
According to personal behavior, predictive model
can use personal characteristic to generate a
probabilistic score, which the higher the score,
the more likely the behavior.
Predictive Model
Linear Model
e.g. Based on a cosmetic ad. We can give 90%
weight to female customers, give10% to male
customer. Based on the click probability (15%), we
can calculate the possibility score (or probability)
Female 13.5%,Male1.5%
Rule Model
e.g.
If the user is “She”
And Income is over 30k
And haven’t seen the ad yet
The click rate is 11%
Simple Predictive Model
Induction
From detail to general
A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E
-- Tom Mitchell (1998)
Discover an effective model
Start from a simple model
Update the model based on feeding data
Keep on improving prediction power
Machine Learning
Statistic Analysis
Regression Analysis
Clustering
Classification
Recommendation
Text Mining
Application
22
Image recognition
Decision Tree
Rate > 1,299/Month
Probability to switch vendor
15%
Probability to switch vendor
3%
Yes No
Decision Tree
Rate > 1,299/Month
Probability to switch vendor
3%
Yes No
Probability to switch vendor
10%
Probability to switch vendor
22%
Income>22,000
Yes No
Decision Tree
Rate > 1,299/Month
Yes No
Probability to switch vendor
10%
Probability to switch vendor
22%
Income>22,000
Yes No
Probability to switch vendor
1%
Probability to switch vendor
7%
Free for intranet
Yes No
Supervised Learning
Regression
Classification
Unsupervised Learning
Dimension Reduction
Clustering
Machine Learning
Supervised Learning
Classification
e.g. Stock prediction on bull/bear market
Regression
e.g. Price prediction
Supervised Learning
Dimension Reduction
e.g. Making a new index
Clustering
e.g. Customer Segmentation
Unsupervised Learning
Lift
The better the lift, the greater the cost?
The more decision rule, the more campaign?
Design strategy for different persona?
The lift for 4 campaign?
The lift for 20 ampaign?
Lift
Can we use the production rate of butter to
predict stock market?
Overfitting
Use noise as information
Over assumption
Over Interpretation
What overfitting learn is not truth
Like memorize all answers in a single test.
Overfitting
Testing Model
Use external data or partial data as testing dataset
Traditional Analysis Tool
Statistics On The Fly
Built-in Math and Graphic Function
Free and Open Source
http://cran.r-project.org/src/base/
R Language
36
Functional Programming
Use Function Definition To Retrieve Answer
Interpreted Language
Statistics On the Fly
Object Oriented Language
S3 and S4 Method
R Language
Most Used Analytic Language
Most popular languages are R,
Python (39%), SQL (37%). SAS
(20%).
By Gregory Piatetsky, Aug 27,
2013.
Data Scientist in Google and Apple Use R
What is your programming language of choice, R,
Python or something else?
“I use R, and occasionally matlab, for data analysis. There is
a large, active and extremely knowledgeable R community at
Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
Discover which customer is likely to churn?
Customer Churn Analysis
Account Information
state
account length.
area code
phone number
User Behavior
international plan
voice mail plan, number vmail messages
total day minutes, total day calls, total day charge
total eve minutes, total eve calls, total eve charge
total night minutes, total night calls, total night charge
total intl minutes, total intl calls, total intl charge
number customer service calls
Target
Churn (Yes/No)
Data Description
> install.packages("C50")
> library(C50)
> data(churn)
> str(churnTrain)
> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",
"area_code", "account_length") ]
> set.seed(2)
> ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))
> trainset = churnTrain[ind == 1,]
> testset = churnTrain[ind == 2,]
Split data into training and testing
dataset
70% as training dataset 30% as testing dataset
churn.rp <- rpart(churn ~ ., data=trainset)
plot(churn.rp, margin= 0.1)
text(churn.rp, all=TRUE, use.n = TRUE)
Build Classifier
Classfication
> predictions <- predict(churn.rp, testset, type="class")
> table(testset$churn, predictions)
Prediction Result
pred no yes
no 859 18
yes 41 100
> confusionMatrix(table(predictions, testset$churn))
Confusion Matrix and Statistics
predictions yes no
yes 100 18
no 41 859
Accuracy : 0.942
95% CI : (0.9259, 0.9556)
No Information Rate : 0.8615
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7393
Mcnemar's Test P-Value : 0.004181
Sensitivity : 0.70922
Specificity : 0.97948
Pos Pred Value : 0.84746
Neg Pred Value : 0.95444
Prevalence : 0.13851
Detection Rate : 0.09823
Detection Prevalence : 0.11591
Balanced Accuracy : 0.84435
'Positive' Class : yes
Use Confusion Matrix
Use Testing Data to Validate Result
predictions <- predict(churn.rp, testset, type="prob")
pred.to.roc <- predictions[, 1]
pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])]))
perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")
perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")
plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",([email protected])))
Finding Most Important Variable
model=fit(churn~.,trainset,model="svm")
VariableImportance=Importance(model,trainset,method="sensv")
L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$
sresponses)
mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
Dynamic Language
Execution at runtime
Dynamic Type
Interpreted Language
See the result after execution
OOP
Python Language
49
Cross Platform(Python VM)
Third-Party Resource
(Data Analysis、Graphics、Website Development)
Simple, and easy to learn
Benefit of Python
Data Analysis
Scipy
Numpy
Scikit-learn
Pandas
51
Company that use python
52
Use InfoLite Tool To Extract DOM
Use Python To Build Up Dashboard
Monitor Social Media and News
Monitor post on social media
Configure keyword and alert
Use line plot to show daily post statistics
55
蘋果, nownews, udn, 中央跟風傳媒 還有其他財經媒體
Daily Statistics Report
56
Examine Associate Article
57
Configure Alert and Keyword
58
Configure Monitor Channel
59
Track Specific Article
60
Have You Learned Big Data?
61
The 3Vs of Big Data
Product
Centric
Customer
Centric
Product Centric v.s. Customer Centric
Customer Centric?
http://goo.gl/iuy4lY
Personal Recommendation
Knowing Who You Are?
Personal recommendation
Customer relation management
Knowing What Futures Likes?
From the history, we can see the future
Predictive analysis
Knowing What is Hidden Beneath?
Correlation, Correlation, Correlation
So… What is Big Data?
So… How To Analyze?
Apache Project – From Yahoo
Feature
Extensible
Cost Effective
Flexible
High Fault Tolerant
Hadoop
Hadoop Eco System
HDFS
MR IMPALA HBASE
PIG HIVE
SQOOP FLUME
HUE, Oozie, Mahout
Tools for different scale
Size Classification Tools
Lines Sample Data
Analysis and Visualisation Whiteboard, Bash, ...
KBs – low MBs Prototype Data
Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ...
MBs – low GBs Online Data
Storage MySQL (DBs), ...
Analysis NumPy, SciPy, Pandas, Weka..
Visualisation Flare, AmCharts, Raphael
GBs – TBs – PBs Big Data
Storage HDFS, Hbase, Cassandra,...
Analysis
Hive, Giraph, Hama, Mahout
Amazon
Recommendation System
Javascript
Flume
HDFS
HBase Pig
Mahout
Item- Based
User - Based
Monitor User Rating
Send User Behavior to Backend
Use Flume To Collect Streaming Data
From /tmp/postlog.txt To /user/cloudera/flume
JSON sample data
{"food":"Tacos", "person":"Alice", "amount":3}
{"food":"Tomato Soup", "person":"Sarah", "amount":2}
{"food":"Grilled Cheese", "person":"Alex", "amount":5}
Demo Code
second_table = LOAD 'second_table.json'
USING JsonLoader('food:chararray, person:chararray,
amount:int');
Use Pig To Load JSON
Build Recommendation Model
$ hbase shell
> create ‘mydata’, ‘mycf’
Build Table In HBase
Examine Data In HDFS
Use Pig To Transfer Data Into HBase
Examine Data In HBase
Build API
Recommendation System
Focus on algorithm
Divide and Conquer, Trie, Collaborative Filtering
Being an expert of single programming language
But knowing what tools and algorithm you can use to
solve your problem
Define your role
Statistician
Software engineer
What You Should Do
Website:
largitdata.com
ywchiu.com
Email:
Contacts