data analysis - making big data work

Post on 21-Apr-2017

3.206 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Analysis Making Big Data Work

David Chiu

2014/11/24

About Me

Founder of LargitData

Ex-Trend Micro Engineer

ywchiu.com

Big Data & Data Science

US Election Prediction

4

World Cup Prediction

Hurricane Prediction

Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Being A Data Scientist, You Need to Know

That Much? Seriously?

Statistic

Single Variable、Multi Variable、ANOVA

Data Munging

Data Extraction, Transformation, Loading

Data Visualization

Figure, Business Intelligence

Required Skills

What You Probably Need Is A Team

Business Analyst Knowing how to use different tools under

different circumstance

Statistician How to process

big data?

DBA How to deal with unstructured data

Software Engineer

Knowing how to user statistics

Four Dimension

12

Single Machine Memory R Local File

Cloud Distributed Hadoop HDFS

Statistics Analysis Linear Algebra

Architect Management Standard

Concept MapReduce Linear Algebra Logistic Regression

Tool Hadoop PostgreSQL R

Analyst How to use these tools

Hackers R Python Java

“80% are doing summing and averaging”

Content

1. Data Munging

2. Data Analysis

3. Interpret Result

What Data Scientists Do?

Application of Data Analysis

Text Mining

Classify Spam Mail

Build Index

Data Search Engine

Social Network

Analysis

Finding Opinion

Leader

Recommendation

System

What user likes?

Opinion Mining

Positive/Negative

Opinion

Fraud Analysis

Credit Card Fraud

Feed data to computer

Make Computer to Do Analysis

Let Computer Predict For You

Predictive Analysis

Learn from experience (Data), to predict future

behavior

What to Predict?

e.g. Who is likely to click on that ad?

For What?

e.g. According to the click possibility and revenue to

decide which ad to show.

Predictive Analysis

Customer buying beer will also buy pampers?

People are surfing telephone fee rate are likely

to switch its vendor

People belong to same group are tend to have

same telecom vendor

Surprising Conclusion

According to personal behavior, predictive model

can use personal characteristic to generate a

probabilistic score, which the higher the score,

the more likely the behavior.

Predictive Model

Linear Model

e.g. Based on a cosmetic ad. We can give 90%

weight to female customers, give10% to male

customer. Based on the click probability (15%), we

can calculate the possibility score (or probability)

Female 13.5%,Male1.5%

Rule Model

e.g.

If the user is “She”

And Income is over 30k

And haven’t seen the ad yet

The click rate is 11%

Simple Predictive Model

Induction

From detail to general

A computer program is said to learn from experience E with respect to

some task T and some performance measure P, if its performance on T,

as measured by P, improves with experience E

-- Tom Mitchell (1998)

Discover an effective model

Start from a simple model

Update the model based on feeding data

Keep on improving prediction power

Machine Learning

Statistic Analysis

Regression Analysis

Clustering

Classification

Recommendation

Text Mining

Application

22

Image recognition

Decision Tree

Rate > 1,299/Month

Probability to switch vendor

15%

Probability to switch vendor

3%

Yes No

Decision Tree

Rate > 1,299/Month

Probability to switch vendor

3%

Yes No

Probability to switch vendor

10%

Probability to switch vendor

22%

Income>22,000

Yes No

Decision Tree

Rate > 1,299/Month

Yes No

Probability to switch vendor

10%

Probability to switch vendor

22%

Income>22,000

Yes No

Probability to switch vendor

1%

Probability to switch vendor

7%

Free for intranet

Yes No

Supervised Learning

Regression

Classification

Unsupervised Learning

Dimension Reduction

Clustering

Machine Learning

Supervised Learning

Classification

e.g. Stock prediction on bull/bear market

Regression

e.g. Price prediction

Supervised Learning

Dimension Reduction

e.g. Making a new index

Clustering

e.g. Customer Segmentation

Unsupervised Learning

Lift

The better the lift, the greater the cost?

The more decision rule, the more campaign?

Design strategy for different persona?

The lift for 4 campaign?

The lift for 20 ampaign?

Lift

Can we use the production rate of butter to

predict stock market?

Overfitting

Use noise as information

Over assumption

Over Interpretation

What overfitting learn is not truth

Like memorize all answers in a single test.

Overfitting

Testing Model

Use external data or partial data as testing dataset

Traditional Analysis Tool

Statistics On The Fly

Built-in Math and Graphic Function

Free and Open Source

http://cran.r-project.org/src/base/

R Language

36

Functional Programming

Use Function Definition To Retrieve Answer

Interpreted Language

Statistics On the Fly

Object Oriented Language

S3 and S4 Method

R Language

Most Used Analytic Language

Most popular languages are R,

Python (39%), SQL (37%). SAS

(20%).

By Gregory Piatetsky, Aug 27,

2013.

Kaggle

http://www.kaggle.com/

Most often used language

in Kaggle competition

Data Scientist in Google and Apple Use R

What is your programming language of choice, R,

Python or something else?

“I use R, and occasionally matlab, for data analysis. There is

a large, active and extremely knowledgeable R community at

Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/

“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data-scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook

Discover which customer is likely to churn?

Customer Churn Analysis

Account Information

state

account length.

area code

phone number

User Behavior

international plan

voice mail plan, number vmail messages

total day minutes, total day calls, total day charge

total eve minutes, total eve calls, total eve charge

total night minutes, total night calls, total night charge

total intl minutes, total intl calls, total intl charge

number customer service calls

Target

Churn (Yes/No)

Data Description

> install.packages("C50")

> library(C50)

> data(churn)

> str(churnTrain)

> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",

"area_code", "account_length") ]

> set.seed(2)

> ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))

> trainset = churnTrain[ind == 1,]

> testset = churnTrain[ind == 2,]

Split data into training and testing

dataset

70% as training dataset 30% as testing dataset

churn.rp <- rpart(churn ~ ., data=trainset)

plot(churn.rp, margin= 0.1)

text(churn.rp, all=TRUE, use.n = TRUE)

Build Classifier

Classfication

> predictions <- predict(churn.rp, testset, type="class")

> table(testset$churn, predictions)

Prediction Result

pred no yes

no 859 18

yes 41 100

> confusionMatrix(table(predictions, testset$churn))

Confusion Matrix and Statistics

predictions yes no

yes 100 18

no 41 859

Accuracy : 0.942

95% CI : (0.9259, 0.9556)

No Information Rate : 0.8615

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7393

Mcnemar's Test P-Value : 0.004181

Sensitivity : 0.70922

Specificity : 0.97948

Pos Pred Value : 0.84746

Neg Pred Value : 0.95444

Prevalence : 0.13851

Detection Rate : 0.09823

Detection Prevalence : 0.11591

Balanced Accuracy : 0.84435

'Positive' Class : yes

Use Confusion Matrix

Use Testing Data to Validate Result

predictions <- predict(churn.rp, testset, type="prob")

pred.to.roc <- predictions[, 1]

pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])]))

perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")

perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")

plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))

Finding Most Important Variable

model=fit(churn~.,trainset,model="svm")

VariableImportance=Importance(model,trainset,method="sensv")

L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$

sresponses)

mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)

Dynamic Language

Execution at runtime

Dynamic Type

Interpreted Language

See the result after execution

OOP

Python Language

49

Cross Platform(Python VM)

Third-Party Resource

(Data Analysis、Graphics、Website Development)

Simple, and easy to learn

Benefit of Python

Data Analysis

Scipy

Numpy

Scikit-learn

Pandas

51

Company that use python

52

Use InfoLite Tool To Extract DOM

Use Python To Build Up Dashboard

Monitor Social Media and News

Monitor post on social media

Configure keyword and alert

Use line plot to show daily post statistics

55

蘋果, nownews, udn, 中央跟風傳媒 還有其他財經媒體

Daily Statistics Report

56

Examine Associate Article

57

Configure Alert and Keyword

58

Configure Monitor Channel

59

Track Specific Article

60

Have You Learned Big Data?

61

The 3Vs of Big Data

Product

Centric

Customer

Centric

Product Centric v.s. Customer Centric

Customer Centric?

http://goo.gl/iuy4lY

Personal Recommendation

Knowing Who You Are?

Personal recommendation

Customer relation management

Knowing What Futures Likes?

From the history, we can see the future

Predictive analysis

Knowing What is Hidden Beneath?

Correlation, Correlation, Correlation

So… What is Big Data?

So… How To Analyze?

Apache Project – From Yahoo

Feature

Extensible

Cost Effective

Flexible

High Fault Tolerant

Hadoop

Hadoop Eco System

HDFS

MR IMPALA HBASE

PIG HIVE

SQOOP FLUME

HUE, Oozie, Mahout

Tools for different scale

Size Classification Tools

Lines Sample Data

Analysis and Visualisation Whiteboard, Bash, ...

KBs – low MBs Prototype Data

Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ...

MBs – low GBs Online Data

Storage MySQL (DBs), ...

Analysis NumPy, SciPy, Pandas, Weka..

Visualisation Flare, AmCharts, Raphael

GBs – TBs – PBs Big Data

Storage HDFS, Hbase, Cassandra,...

Analysis

Hive, Giraph, Hama, Mahout

Amazon

Facebook

Recommendation System

Javascript

Flume

HDFS

HBase Pig

Mahout

Item- Based

User - Based

Monitor User Rating

Send User Behavior to Backend

Use Flume To Collect Streaming Data

From /tmp/postlog.txt To /user/cloudera/flume

JSON sample data

{"food":"Tacos", "person":"Alice", "amount":3}

{"food":"Tomato Soup", "person":"Sarah", "amount":2}

{"food":"Grilled Cheese", "person":"Alex", "amount":5}

Demo Code

second_table = LOAD 'second_table.json'

USING JsonLoader('food:chararray, person:chararray,

amount:int');

Use Pig To Load JSON

Build Recommendation Model

$ hbase shell

> create ‘mydata’, ‘mycf’

Build Table In HBase

Examine Data In HDFS

Use Pig To Transfer Data Into HBase

Examine Data In HBase

Build API

Recommendation System

Focus on algorithm

Divide and Conquer, Trie, Collaborative Filtering

Being an expert of single programming language

But knowing what tools and algorithm you can use to

solve your problem

Define your role

Statistician

Software engineer

What You Should Do

Website:

largitdata.com

ywchiu.com

Email:

david@largitdata.com

tr.ywchiu@gmail.com

Contacts

top related