data analysis - making big data work

91
Data Analysis Making Big Data Work David Chiu 2014/11/24

Upload: david-chiu

Post on 21-Apr-2017

3.206 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Analysis - Making Big Data Work

Data Analysis Making Big Data Work

David Chiu

2014/11/24

Page 2: Data Analysis - Making Big Data Work

About Me

Founder of LargitData

Ex-Trend Micro Engineer

ywchiu.com

Page 3: Data Analysis - Making Big Data Work

Big Data & Data Science

Page 4: Data Analysis - Making Big Data Work

US Election Prediction

4

Page 5: Data Analysis - Making Big Data Work

World Cup Prediction

Page 6: Data Analysis - Making Big Data Work

Hurricane Prediction

Page 7: Data Analysis - Making Big Data Work

Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 8: Data Analysis - Making Big Data Work
Page 9: Data Analysis - Making Big Data Work

Being A Data Scientist, You Need to Know

That Much? Seriously?

Page 10: Data Analysis - Making Big Data Work

Statistic

Single Variable、Multi Variable、ANOVA

Data Munging

Data Extraction, Transformation, Loading

Data Visualization

Figure, Business Intelligence

Required Skills

Page 11: Data Analysis - Making Big Data Work

What You Probably Need Is A Team

Business Analyst Knowing how to use different tools under

different circumstance

Statistician How to process

big data?

DBA How to deal with unstructured data

Software Engineer

Knowing how to user statistics

Page 12: Data Analysis - Making Big Data Work

Four Dimension

12

Single Machine Memory R Local File

Cloud Distributed Hadoop HDFS

Statistics Analysis Linear Algebra

Architect Management Standard

Concept MapReduce Linear Algebra Logistic Regression

Tool Hadoop PostgreSQL R

Analyst How to use these tools

Hackers R Python Java

Page 13: Data Analysis - Making Big Data Work

“80% are doing summing and averaging”

Content

1. Data Munging

2. Data Analysis

3. Interpret Result

What Data Scientists Do?

Page 14: Data Analysis - Making Big Data Work

Application of Data Analysis

Text Mining

Classify Spam Mail

Build Index

Data Search Engine

Social Network

Analysis

Finding Opinion

Leader

Recommendation

System

What user likes?

Opinion Mining

Positive/Negative

Opinion

Fraud Analysis

Credit Card Fraud

Page 15: Data Analysis - Making Big Data Work

Feed data to computer

Make Computer to Do Analysis

Page 16: Data Analysis - Making Big Data Work

Let Computer Predict For You

Page 17: Data Analysis - Making Big Data Work

Predictive Analysis

Learn from experience (Data), to predict future

behavior

What to Predict?

e.g. Who is likely to click on that ad?

For What?

e.g. According to the click possibility and revenue to

decide which ad to show.

Predictive Analysis

Page 18: Data Analysis - Making Big Data Work

Customer buying beer will also buy pampers?

People are surfing telephone fee rate are likely

to switch its vendor

People belong to same group are tend to have

same telecom vendor

Surprising Conclusion

Page 19: Data Analysis - Making Big Data Work

According to personal behavior, predictive model

can use personal characteristic to generate a

probabilistic score, which the higher the score,

the more likely the behavior.

Predictive Model

Page 20: Data Analysis - Making Big Data Work

Linear Model

e.g. Based on a cosmetic ad. We can give 90%

weight to female customers, give10% to male

customer. Based on the click probability (15%), we

can calculate the possibility score (or probability)

Female 13.5%,Male1.5%

Rule Model

e.g.

If the user is “She”

And Income is over 30k

And haven’t seen the ad yet

The click rate is 11%

Simple Predictive Model

Page 21: Data Analysis - Making Big Data Work

Induction

From detail to general

A computer program is said to learn from experience E with respect to

some task T and some performance measure P, if its performance on T,

as measured by P, improves with experience E

-- Tom Mitchell (1998)

Discover an effective model

Start from a simple model

Update the model based on feeding data

Keep on improving prediction power

Machine Learning

Page 22: Data Analysis - Making Big Data Work

Statistic Analysis

Regression Analysis

Clustering

Classification

Recommendation

Text Mining

Application

22

Page 23: Data Analysis - Making Big Data Work

Image recognition

Page 24: Data Analysis - Making Big Data Work

Decision Tree

Rate > 1,299/Month

Probability to switch vendor

15%

Probability to switch vendor

3%

Yes No

Page 25: Data Analysis - Making Big Data Work

Decision Tree

Rate > 1,299/Month

Probability to switch vendor

3%

Yes No

Probability to switch vendor

10%

Probability to switch vendor

22%

Income>22,000

Yes No

Page 26: Data Analysis - Making Big Data Work

Decision Tree

Rate > 1,299/Month

Yes No

Probability to switch vendor

10%

Probability to switch vendor

22%

Income>22,000

Yes No

Probability to switch vendor

1%

Probability to switch vendor

7%

Free for intranet

Yes No

Page 27: Data Analysis - Making Big Data Work

Supervised Learning

Regression

Classification

Unsupervised Learning

Dimension Reduction

Clustering

Machine Learning

Page 28: Data Analysis - Making Big Data Work

Supervised Learning

Page 29: Data Analysis - Making Big Data Work

Classification

e.g. Stock prediction on bull/bear market

Regression

e.g. Price prediction

Supervised Learning

Page 30: Data Analysis - Making Big Data Work

Dimension Reduction

e.g. Making a new index

Clustering

e.g. Customer Segmentation

Unsupervised Learning

Page 31: Data Analysis - Making Big Data Work

Lift

The better the lift, the greater the cost?

The more decision rule, the more campaign?

Design strategy for different persona?

The lift for 4 campaign?

The lift for 20 ampaign?

Lift

Page 32: Data Analysis - Making Big Data Work

Can we use the production rate of butter to

predict stock market?

Overfitting

Page 33: Data Analysis - Making Big Data Work

Use noise as information

Over assumption

Over Interpretation

What overfitting learn is not truth

Like memorize all answers in a single test.

Overfitting

Page 34: Data Analysis - Making Big Data Work

Testing Model

Use external data or partial data as testing dataset

Page 35: Data Analysis - Making Big Data Work

Traditional Analysis Tool

Page 36: Data Analysis - Making Big Data Work

Statistics On The Fly

Built-in Math and Graphic Function

Free and Open Source

http://cran.r-project.org/src/base/

R Language

36

Page 37: Data Analysis - Making Big Data Work

Functional Programming

Use Function Definition To Retrieve Answer

Interpreted Language

Statistics On the Fly

Object Oriented Language

S3 and S4 Method

R Language

Page 38: Data Analysis - Making Big Data Work

Most Used Analytic Language

Most popular languages are R,

Python (39%), SQL (37%). SAS

(20%).

By Gregory Piatetsky, Aug 27,

2013.

Page 39: Data Analysis - Making Big Data Work

Kaggle

http://www.kaggle.com/

Most often used language

in Kaggle competition

Page 40: Data Analysis - Making Big Data Work

Data Scientist in Google and Apple Use R

What is your programming language of choice, R,

Python or something else?

“I use R, and occasionally matlab, for data analysis. There is

a large, active and extremely knowledgeable R community at

Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/

“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook

Page 41: Data Analysis - Making Big Data Work

Discover which customer is likely to churn?

Customer Churn Analysis

Page 42: Data Analysis - Making Big Data Work

Account Information

state

account length.

area code

phone number

User Behavior

international plan

voice mail plan, number vmail messages

total day minutes, total day calls, total day charge

total eve minutes, total eve calls, total eve charge

total night minutes, total night calls, total night charge

total intl minutes, total intl calls, total intl charge

number customer service calls

Target

Churn (Yes/No)

Data Description

Page 43: Data Analysis - Making Big Data Work

> install.packages("C50")

> library(C50)

> data(churn)

> str(churnTrain)

> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",

"area_code", "account_length") ]

> set.seed(2)

> ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))

> trainset = churnTrain[ind == 1,]

> testset = churnTrain[ind == 2,]

Split data into training and testing

dataset

70% as training dataset 30% as testing dataset

Page 44: Data Analysis - Making Big Data Work

churn.rp <- rpart(churn ~ ., data=trainset)

plot(churn.rp, margin= 0.1)

text(churn.rp, all=TRUE, use.n = TRUE)

Build Classifier

Classfication

Page 45: Data Analysis - Making Big Data Work

> predictions <- predict(churn.rp, testset, type="class")

> table(testset$churn, predictions)

Prediction Result

pred no yes

no 859 18

yes 41 100

Page 46: Data Analysis - Making Big Data Work

> confusionMatrix(table(predictions, testset$churn))

Confusion Matrix and Statistics

predictions yes no

yes 100 18

no 41 859

Accuracy : 0.942

95% CI : (0.9259, 0.9556)

No Information Rate : 0.8615

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7393

Mcnemar's Test P-Value : 0.004181

Sensitivity : 0.70922

Specificity : 0.97948

Pos Pred Value : 0.84746

Neg Pred Value : 0.95444

Prevalence : 0.13851

Detection Rate : 0.09823

Detection Prevalence : 0.11591

Balanced Accuracy : 0.84435

'Positive' Class : yes

Use Confusion Matrix

Page 47: Data Analysis - Making Big Data Work

Use Testing Data to Validate Result

predictions <- predict(churn.rp, testset, type="prob")

pred.to.roc <- predictions[, 1]

pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])]))

perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")

perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")

plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",([email protected])))

Page 48: Data Analysis - Making Big Data Work

Finding Most Important Variable

model=fit(churn~.,trainset,model="svm")

VariableImportance=Importance(model,trainset,method="sensv")

L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$

sresponses)

mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)

Page 49: Data Analysis - Making Big Data Work

Dynamic Language

Execution at runtime

Dynamic Type

Interpreted Language

See the result after execution

OOP

Python Language

49

Page 50: Data Analysis - Making Big Data Work

Cross Platform(Python VM)

Third-Party Resource

(Data Analysis、Graphics、Website Development)

Simple, and easy to learn

Benefit of Python

Page 51: Data Analysis - Making Big Data Work

Data Analysis

Scipy

Numpy

Scikit-learn

Pandas

51

Page 52: Data Analysis - Making Big Data Work

Company that use python

52

Page 53: Data Analysis - Making Big Data Work

Use InfoLite Tool To Extract DOM

Page 54: Data Analysis - Making Big Data Work

Use Python To Build Up Dashboard

Page 55: Data Analysis - Making Big Data Work

Monitor Social Media and News

Monitor post on social media

Configure keyword and alert

Use line plot to show daily post statistics

55

蘋果, nownews, udn, 中央跟風傳媒 還有其他財經媒體

Page 56: Data Analysis - Making Big Data Work

Daily Statistics Report

56

Page 57: Data Analysis - Making Big Data Work

Examine Associate Article

57

Page 58: Data Analysis - Making Big Data Work

Configure Alert and Keyword

58

Page 59: Data Analysis - Making Big Data Work

Configure Monitor Channel

59

Page 60: Data Analysis - Making Big Data Work

Track Specific Article

60

Page 61: Data Analysis - Making Big Data Work

Have You Learned Big Data?

61

Page 62: Data Analysis - Making Big Data Work
Page 63: Data Analysis - Making Big Data Work

The 3Vs of Big Data

Page 64: Data Analysis - Making Big Data Work
Page 65: Data Analysis - Making Big Data Work

Product

Centric

Customer

Centric

Product Centric v.s. Customer Centric

Page 66: Data Analysis - Making Big Data Work

Customer Centric?

http://goo.gl/iuy4lY

Page 67: Data Analysis - Making Big Data Work

Personal Recommendation

Page 68: Data Analysis - Making Big Data Work

Knowing Who You Are?

Personal recommendation

Customer relation management

Knowing What Futures Likes?

From the history, we can see the future

Predictive analysis

Knowing What is Hidden Beneath?

Correlation, Correlation, Correlation

So… What is Big Data?

Page 69: Data Analysis - Making Big Data Work

So… How To Analyze?

Page 70: Data Analysis - Making Big Data Work

Apache Project – From Yahoo

Feature

Extensible

Cost Effective

Flexible

High Fault Tolerant

Hadoop

Page 71: Data Analysis - Making Big Data Work

Hadoop Eco System

HDFS

MR IMPALA HBASE

PIG HIVE

SQOOP FLUME

HUE, Oozie, Mahout

Page 72: Data Analysis - Making Big Data Work

Tools for different scale

Size Classification Tools

Lines Sample Data

Analysis and Visualisation Whiteboard, Bash, ...

KBs – low MBs Prototype Data

Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ...

MBs – low GBs Online Data

Storage MySQL (DBs), ...

Analysis NumPy, SciPy, Pandas, Weka..

Visualisation Flare, AmCharts, Raphael

GBs – TBs – PBs Big Data

Storage HDFS, Hbase, Cassandra,...

Analysis

Hive, Giraph, Hama, Mahout

Page 73: Data Analysis - Making Big Data Work

Amazon

Page 74: Data Analysis - Making Big Data Work

Facebook

Page 75: Data Analysis - Making Big Data Work

Recommendation System

Javascript

Flume

HDFS

HBase Pig

Mahout

Page 76: Data Analysis - Making Big Data Work

Item- Based

Page 77: Data Analysis - Making Big Data Work

User - Based

Page 78: Data Analysis - Making Big Data Work

Monitor User Rating

Page 79: Data Analysis - Making Big Data Work

Send User Behavior to Backend

Page 80: Data Analysis - Making Big Data Work

Use Flume To Collect Streaming Data

From /tmp/postlog.txt To /user/cloudera/flume

Page 81: Data Analysis - Making Big Data Work

JSON sample data

{"food":"Tacos", "person":"Alice", "amount":3}

{"food":"Tomato Soup", "person":"Sarah", "amount":2}

{"food":"Grilled Cheese", "person":"Alex", "amount":5}

Demo Code

second_table = LOAD 'second_table.json'

USING JsonLoader('food:chararray, person:chararray,

amount:int');

Use Pig To Load JSON

Page 82: Data Analysis - Making Big Data Work

Build Recommendation Model

Page 83: Data Analysis - Making Big Data Work

$ hbase shell

> create ‘mydata’, ‘mycf’

Build Table In HBase

Page 84: Data Analysis - Making Big Data Work

Examine Data In HDFS

Page 85: Data Analysis - Making Big Data Work

Use Pig To Transfer Data Into HBase

Page 86: Data Analysis - Making Big Data Work

Examine Data In HBase

Page 87: Data Analysis - Making Big Data Work

Build API

Page 88: Data Analysis - Making Big Data Work

Recommendation System

Page 89: Data Analysis - Making Big Data Work

Focus on algorithm

Divide and Conquer, Trie, Collaborative Filtering

Being an expert of single programming language

But knowing what tools and algorithm you can use to

solve your problem

Define your role

Statistician

Software engineer

What You Should Do

Page 91: Data Analysis - Making Big Data Work