data analysis - making big data work

Data Analysis Making Big Data Work

David Chiu

2014/11/24

About Me

Founder of LargitData

Ex-Trend Micro Engineer

ywchiu.com

Big Data & Data Science

US Election Prediction

World Cup Prediction

Hurricane Prediction

Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Being A Data Scientist, You Need to Know

That Much? Seriously?

Statistic

Single Variable、Multi Variable、ANOVA

Data Munging

Data Extraction, Transformation, Loading

Data Visualization

Figure, Business Intelligence

Required Skills

What You Probably Need Is A Team

Business Analyst Knowing how to use different tools under

different circumstance

Statistician How to process

big data?

DBA How to deal with unstructured data

Software Engineer

Knowing how to user statistics

Four Dimension

Single Machine Memory R Local File

Cloud Distributed Hadoop HDFS

Statistics Analysis Linear Algebra

Architect Management Standard

Concept MapReduce Linear Algebra Logistic Regression

Tool Hadoop PostgreSQL R

Analyst How to use these tools

Hackers R Python Java

“80% are doing summing and averaging”

Content

1. Data Munging

2. Data Analysis

3. Interpret Result

What Data Scientists Do?

Application of Data Analysis

Text Mining

Classify Spam Mail

Build Index

Data Search Engine

Social Network

Analysis

Finding Opinion

Leader

Recommendation

System

What user likes?

Opinion Mining

Positive/Negative

Opinion

Fraud Analysis

Credit Card Fraud

Feed data to computer

Make Computer to Do Analysis

Let Computer Predict For You

Predictive Analysis

Learn from experience (Data), to predict future

behavior

What to Predict？

e.g. Who is likely to click on that ad？

For What？

e.g. According to the click possibility and revenue to

decide which ad to show.

Predictive Analysis

Customer buying beer will also buy pampers?

People are surfing telephone fee rate are likely

to switch its vendor

People belong to same group are tend to have

same telecom vendor

Surprising Conclusion

According to personal behavior, predictive model

can use personal characteristic to generate a

probabilistic score, which the higher the score,

the more likely the behavior.

Predictive Model

Linear Model

e.g. Based on a cosmetic ad. We can give 90%

weight to female customers, give10% to male

customer. Based on the click probability (15%), we

can calculate the possibility score (or probability)

Female 13.5%，Male1.5%

Rule Model

If the user is “She”

And Income is over 30k

And haven’t seen the ad yet

The click rate is 11%

Simple Predictive Model

Induction

From detail to general

A computer program is said to learn from experience E with respect to

some task T and some performance measure P, if its performance on T,

as measured by P, improves with experience E

-- Tom Mitchell (1998)

Discover an effective model

Start from a simple model

Update the model based on feeding data

Keep on improving prediction power

Machine Learning

Statistic Analysis

Regression Analysis

Clustering

Classification

Recommendation

Text Mining

Application

Image recognition

Decision Tree

Rate > 1,299/Month

Probability to switch vendor

Yes No

Decision Tree

Rate > 1,299/Month

Yes No

Income>22,000

Yes No

Decision Tree

Rate > 1,299/Month

Yes No

Income>22,000

Yes No

Free for intranet

Yes No

Supervised Learning

Regression

Classification

Unsupervised Learning

Dimension Reduction

Clustering

Machine Learning

Supervised Learning

Classification

e.g. Stock prediction on bull/bear market

Regression

e.g. Price prediction

Supervised Learning

Dimension Reduction

e.g. Making a new index

Clustering

e.g. Customer Segmentation

Unsupervised Learning

The better the lift, the greater the cost?

The more decision rule, the more campaign?

Design strategy for different persona?

The lift for 4 campaign?

The lift for 20 ampaign?

Can we use the production rate of butter to

predict stock market?

Overfitting

Use noise as information

Over assumption

Over Interpretation

What overfitting learn is not truth

Like memorize all answers in a single test.

Overfitting

Testing Model

Use external data or partial data as testing dataset

Traditional Analysis Tool

Statistics On The Fly

Built-in Math and Graphic Function

Free and Open Source

http://cran.r-project.org/src/base/

R Language

Functional Programming

Use Function Definition To Retrieve Answer

Interpreted Language

Statistics On the Fly

Object Oriented Language

S3 and S4 Method

R Language

Most Used Analytic Language

Most popular languages are R,

Python (39%), SQL (37%). SAS

(20%).

By Gregory Piatetsky, Aug 27,

Kaggle

http://www.kaggle.com/

Most often used language

in Kaggle competition

Data Scientist in Google and Apple Use R

What is your programming language of choice, R,

Python or something else?

“I use R, and occasionally matlab, for data analysis. There is

a large, active and extremely knowledgeable R community at

Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/

“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook

Discover which customer is likely to churn?

Customer Churn Analysis

Account Information

account length.

area code

phone number

User Behavior

international plan

voice mail plan, number vmail messages

total day minutes, total day calls, total day charge

total eve minutes, total eve calls, total eve charge

total night minutes, total night calls, total night charge

total intl minutes, total intl calls, total intl charge

number customer service calls

Target

Churn (Yes/No)

Data Description

> install.packages("C50")

> library(C50)

> data(churn)

> str(churnTrain)

> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",

"area_code", "account_length") ]

> set.seed(2)

> ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))

> trainset = churnTrain[ind == 1,]

> testset = churnTrain[ind == 2,]

Split data into training and testing

dataset

70% as training dataset 30% as testing dataset

churn.rp <- rpart(churn ~ ., data=trainset)

plot(churn.rp, margin= 0.1)

text(churn.rp, all=TRUE, use.n = TRUE)

Build Classifier

Classfication

> predictions <- predict(churn.rp, testset, type="class")

> table(testset$churn, predictions)

Prediction Result

pred no yes

no 859 18

yes 41 100

> confusionMatrix(table(predictions, testset$churn))

Confusion Matrix and Statistics

predictions yes no

yes 100 18

no 41 859

Accuracy : 0.942

95% CI : (0.9259, 0.9556)

No Information Rate : 0.8615

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7393

Mcnemar's Test P-Value : 0.004181

Sensitivity : 0.70922

Specificity : 0.97948

Pos Pred Value : 0.84746

Neg Pred Value : 0.95444

Prevalence : 0.13851

Detection Rate : 0.09823

Detection Prevalence : 0.11591

Balanced Accuracy : 0.84435

'Positive' Class : yes

Use Confusion Matrix

Use Testing Data to Validate Result

predictions <- predict(churn.rp, testset, type="prob")

pred.to.roc <- predictions[, 1]

pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])]))

perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")

perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")

plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))

Finding Most Important Variable

model=fit(churn~.,trainset,model="svm")

VariableImportance=Importance(model,trainset,method="sensv")

L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$

sresponses)

mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)

Dynamic Language

Execution at runtime

Dynamic Type

Interpreted Language

See the result after execution

Python Language

Cross Platform(Python VM)

Third-Party Resource

(Data Analysis、Graphics、Website Development)

Simple, and easy to learn

Benefit of Python

Data Analysis

Scikit-learn

Pandas

Company that use python

Use InfoLite Tool To Extract DOM

Use Python To Build Up Dashboard

Monitor Social Media and News

Monitor post on social media

Configure keyword and alert

Use line plot to show daily post statistics

蘋果, nownews, udn, 中央跟風傳媒還有其他財經媒體

Daily Statistics Report

Examine Associate Article

Configure Alert and Keyword

Configure Monitor Channel

Track Specific Article

Have You Learned Big Data?

The 3Vs of Big Data

Product

Centric

Customer

Centric

Product Centric v.s. Customer Centric

Customer Centric?

http://goo.gl/iuy4lY

Personal Recommendation

Knowing Who You Are?

Personal recommendation

Customer relation management

Knowing What Futures Likes?

From the history, we can see the future

Predictive analysis

Knowing What is Hidden Beneath?

Correlation, Correlation, Correlation

So… What is Big Data?

So… How To Analyze?

Apache Project – From Yahoo

Feature

Extensible

Cost Effective

Flexible

High Fault Tolerant

Hadoop

Hadoop Eco System

MR IMPALA HBASE

PIG HIVE

SQOOP FLUME

HUE, Oozie, Mahout

Tools for different scale

Size Classification Tools

Lines Sample Data

Analysis and Visualisation Whiteboard, Bash, ...

KBs – low MBs Prototype Data

Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ...

MBs – low GBs Online Data

Storage MySQL (DBs), ...

Analysis NumPy, SciPy, Pandas, Weka..

Visualisation Flare, AmCharts, Raphael

GBs – TBs – PBs Big Data

Storage HDFS, Hbase, Cassandra,...

Analysis

Hive, Giraph, Hama, Mahout

Amazon

Facebook

Recommendation System

Javascript

HBase Pig

Mahout

Item- Based

User - Based

Monitor User Rating

Send User Behavior to Backend

Use Flume To Collect Streaming Data

From /tmp/postlog.txt To /user/cloudera/flume

JSON sample data

{"food":"Tacos", "person":"Alice", "amount":3}

{"food":"Tomato Soup", "person":"Sarah", "amount":2}

{"food":"Grilled Cheese", "person":"Alex", "amount":5}

Demo Code

second_table = LOAD 'second_table.json'

USING JsonLoader('food:chararray, person:chararray,

amount:int');

Use Pig To Load JSON

Build Recommendation Model

$ hbase shell

> create ‘mydata’, ‘mycf’

Build Table In HBase

Examine Data In HDFS

Use Pig To Transfer Data Into HBase

Examine Data In HBase

Build API

Recommendation System

Focus on algorithm

Divide and Conquer, Trie, Collaborative Filtering

Being an expert of single programming language

But knowing what tools and algorithm you can use to

solve your problem

Define your role

Statistician

Software engineer

What You Should Do

Website:

largitdata.com

ywchiu.com

Email:

david@largitdata.com

tr.ywchiu@gmail.com

Contacts

data analysis - making big data work

Data & Analytics

big data analytics in b2b ecommerce - making big decisions

making big data portable - strata 2014 presentation

making use of big data october 2015

making big data work

making sense out of your big data

automated decision making with big data – big data vienna

making small data big: the biodiversity data journal (bdj)

making big sense from big data in toxicology by read-across

making use of big data

making obamacare work with big data

making big data relevant for your business!

making money with big data

making big data visible

making big data useful data - goto con

big data, big opportunity: making sense of big data for pr

making big data relevant: importance of data visualization...

ben marden - making sense of big data

small devices, big data and decision making

making big data smaller

big data, data-driven decision making and statistics · big...