hadoop dev 01

51
NYC Data Science Academy Hadoop Application Development with Real Cases Hadoop Application Development with Real Cases

Upload: vivian-s-zhang

Post on 24-Apr-2015

1.668 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Hadoop Application Development with Real

Cases

Page 2: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Multi-layer Model

2

Page 3: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Pyramid and Character

Business personnel

ETL Engineer

Data Warehouse Engineer

Analyzer

Data Visualization

Engineer

IT supporter: Operation-

Maintanence, Programmer

3

Page 4: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis

Analyze collected data with statistical methods on purpose, then

understand and implement the result

4

Page 5: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Mining

Data Mining is a technique focusing on retrieving hidden information in the data. It is a

process that apply knowledge-discovery algorithms to large database and show the

associations to the users.

Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine

Learning

Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis

Case: Beer and Diaper

Science: Detecting Novel Associations in Large Data Sets

5

Page 6: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Business Intelligence

BI = Data Warehouses (Storage) + Data Analysis and Data Mining

(Analysis) + Report (Demonstration)

Our course

6

Page 7: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis Algorithms

Popular Algorithms

7

Page 8: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Regression

8

Page 9: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Time Series Analysis

Page 10: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Classifier

10

Page 11: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Clustering

11

Page 12: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Association Rules

12

Page 13: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis

Data Analysis Tools

13

Page 14: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Popular Data Analysis Tools Ranking

14

Page 15: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis stages

stage 1: Dominate by Business personnel

stage 2: Dominate by both Business personnel and Analyzer

stage 3: Dominate by Analyzer

15

Page 16: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis in stage 1

Business staff set all the requirements and most analysis plans

According to experiences, Business staff select features, set

threshold, and IT staff search, integrate data, analyzer make

report

Feature selection and choice of threshold is based on experience

and personal knowledge

Suitable for simple cases, analysis technique is equivalent to the

simplest decision tree

Business staffs has valuable experiences and hard to be replaced,

analyzers are just for graphing and is easily replaced

This is common in the traditional industry

16

Page 17: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis in stage 2

More complex. Business staffs could analyze a small

number of data records while cannot figure out all the

features and the relationship among them. They have no

experience with large number of samples.

Analyzer come to clean data and select features, and finally

build suitable model to solve problem.

Business staffs and analyzer could evaluate the result

together, very likely to success. Analyzer prefer this step

because their ability and value is confirmed.

17

Page 18: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Spammer in Wordpress

Page 19: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Data Analysis in stage 3

Business staffs have no experience for

the case, and cannot offer any useful

prior knowledge

Data analyzers use various tools and

models to mine the data and trying to

have interesting discovery

It is analyzer’s ideal world, while it is

likely to fail

Business staffs cannot get involved, and

they dislike this stage

19

Page 20: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Step Forward

The first stage(Gold on the ground) -> The

second stage(Gold beneath the ground) -> The

third stage (Gold deeply buried)

If analyzers are reckless, business staffs will resist

to help

Data analysis is rooted in the business

background. The goal of analysis is increasing

profit. Successful analysis could not be apart from

business

Interesting topic is more important than the

model

20

Page 21: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

What is Big Data

Page 22: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Features of Big Data

Page 23: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Challenges for Analyzers

Bottleneck for both insertion and query due to the increasing amount of

data

The trend of integrating users’ application and analysis result is asking for

faster real-time computation and response time

More complex models require more expensive computation

23

Page 24: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Dilemma of Traditional Data Analysis Tools

R, SAS, SPSS are experimental tools

Capable data size is restricted by the memory size

Use Oracle database for large volume of data, but lack of professional and

fast analyzing ability

Sampling is a limited solution, it is not useful for clustering and

recommendation system

Solution: Hadoop cluster and Map-Reduce parallel computing

24

Page 25: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Case 1: analysis and monitor for a telecommunication company

25

Page 26: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Case 1: analysis and monitor for a telecommunication company

Configuration of the original database server: HP minicomputer, 128G

memory, 48-core CPU, RAC with two nodes, one node for insertion and the

other for query

Storage: HP virtual storage, over 1000 disks

Architecture: Oracle RAC with two nodes

Bottleneck: 1. Insertion 2. Query

26

Page 27: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Case 2: DNA database

27

Page 28: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Case 3: Social analysis, activity fingerprint detection

28 | April 11, 2023 |

Public Voice mail

intersect IMSI 1 IMSI 2 …… IMSI ntotal call duration

User A IMSI 20% 12% …… 5% 365

User B IMSI 15% 13% …… 2% 310

Public SMS intersect IMSI 1 IMSI 2 …… IMSI n

Monthly SMS count

User A IMSI 50% 10% …… 5% 200

User B IMSI 20% 13% …… 2% 260

Public base station CGI 1 CGI 2 …… CGI n Shutdown

User A IMSI 20% 12% …… 5% 20%

User B IMSI 15% 13% …… 2% 5%

Public Fingerprint

(0.2, 0.12, …, 0.05)(0.15, 0.13, …, 0.02)

(0.5, 0.1, …, 0.05)(0.2, 0.13, …, 0.02)

(0.2, 0.12, …, 0.05, 0.2)(0.15, 0.13, …, 0.02, 0.05)

eigenvector

Page 29: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

When equals to , these two vectors are independent

When equals to 0 , these two vectors are perfectly dependent

The closer is from 0, the more dependent these vectors are

90

Case 3: Social analysis, activity fingerprint detection

29

Page 30: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Case 3: Social analysis, VIP detection

30

Page 31: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Solution that analyzers look forward to

Perfectly eliminate the bottleneck in the foreseeable future

Smoothly transplant available techniques, for example SQL and R.

The cost of new platform: hardware and software, re-development, skill

training, maintenance

31

Page 32: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Path to Big Data

Page 33: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Idea of Hadoop

33

Page 34: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Map-Reduce Programming

34

Page 35: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Map-Reduce program for meteorological data analysis

35

Page 36: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Map-Reduce implementation for popular algorithms

36

Page 37: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Map-Reduce implementation for popular algorithms

37

Page 38: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Why not Hadoop?

Java?

Hard to control?

Hard to integrate data?

Hadoop vs Oracle

38

Page 39: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Analysis under Hadoop system

Mainstream: Java program

Light-weighted script language: Pig

Smooth transplant from SQL: Hive

NoSQL: HBase

39

Page 40: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Family of Hadoop

40

Page 41: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

pig

Pig could be treated as a client

software to the hadoop, could

connect to hadoop and analyze

Pig is convenient for users

unfamiliar with java, using a SQL-

like language, pig latin, dealing

with data flow

Pig latin could perform sorting,

filtering, sum, grouping,

association, and define custom

functions. It is a light-weighted

script language for data operation

and analysis

Pig could be treated as the

mapping from pig latin to map-

reduce

41

Page 42: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Hive

Data warehouse tool, could turn

primary data structure in

Hadoop into tables in Hive

Support HiveQL, a language

almost the same as SQL, its

function is the same as SQL

except updating, indexing and

could be treated as the mapping

from SQL to map-reduce

Offering interfaces for

shell、 JDBC/ODBC、 Thrift、W

eb

42

Page 43: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Features of Mahout

Mahout is for scalable machine

learning algorithms (M-R

implementation), and Hadoop

platform is not necessary. The core

library also have efficient algorithms

on single machine

Mature and popular algorithms are

1. Frequent Itemset Mining

2. Clustering

3. Classifier

4. Recommendation System

5. Frequent Subgraph Mining

43

Page 44: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Reference Textbooks

Page 45: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Reference Textbooks

Page 46: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Reference Textbooks

Page 47: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Reference Textbooks

47

Page 48: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Typical Experiment Environtment(with server)

Server: ESXi, capable of deploying multiple virtual machines and could run

3 machines at the same time

PC: Linux or Windows+Cygwin, linux could be standalone or a virtual

machine

SSH: Use command ssh under linux, and SecureCRT or putty under

Windows to connect with remote linux server

Vmware client: Management of ESXi

Hadoop: Use version 1.x or 2.x

48

Page 49: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Typical Experiment Environtment(with only PC or laptop running Windows) At Least 4G memory, 64bit windows is preferred, because 32bit machine

can use only more than 3G memory.

Install vmware workstation or virtual box

Deploy 3 virtual machines and running at the same time. If can only run

two VMs, treat host as a node (by cygwin), and use bridged networking for

virtual network

Install Linux and Java

Old computers could consider pseudo-distributed environment

49

Page 50: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Experiment Environment

Deploy Pig

Deploy Hive

Deploy Mahout

Page 51: Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

List of Cases of the Course

Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)

LBS application for telecommunication company; Analysis of trace of user‘s mobile

phone(Map-Reduce)

User analysis for telecommunication company; Labeling duplicated users by the

fingerprint of calls(Map-Reduce)

Recommendation system for E-commerce company(Map-Reduce)

Complicated recommendation system application(mahout)

Social network; Distance between users; Community detection(Pig)

Importance of nodes in a social network(Map-Reduce)

Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)

Financial data analysis; Retrieve reverse repurchase information from historical

data(Hive)

Set stock strategies with data analysis(Map-Reduce, Hive)

GPS application; Sign-in data analysis(Pig)

Implementation and optimization of sorting on Map-Reduce

Middleware development; Cooperation of multiple Hadoop clusters