Download - Hadoop dev 01
![Page 1: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/1.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Hadoop Application Development with Real
Cases
![Page 2: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/2.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Multi-layer Model
2
![Page 3: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/3.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Pyramid and Character
Business personnel
ETL Engineer
Data Warehouse Engineer
Analyzer
Data Visualization
Engineer
IT supporter: Operation-
Maintanence, Programmer
3
![Page 4: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/4.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis
Analyze collected data with statistical methods on purpose, then
understand and implement the result
4
![Page 5: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/5.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Mining
Data Mining is a technique focusing on retrieving hidden information in the data. It is a
process that apply knowledge-discovery algorithms to large database and show the
associations to the users.
Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine
Learning
Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
Case: Beer and Diaper
Science: Detecting Novel Associations in Large Data Sets
5
![Page 6: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/6.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Business Intelligence
BI = Data Warehouses (Storage) + Data Analysis and Data Mining
(Analysis) + Report (Demonstration)
Our course
6
![Page 7: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/7.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis Algorithms
Popular Algorithms
7
![Page 8: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/8.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Regression
8
![Page 9: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/9.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Time Series Analysis
![Page 10: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/10.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Classifier
10
![Page 11: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/11.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Clustering
11
![Page 12: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/12.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Association Rules
12
![Page 13: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/13.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis
Data Analysis Tools
13
![Page 14: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/14.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Popular Data Analysis Tools Ranking
14
![Page 15: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/15.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis stages
stage 1: Dominate by Business personnel
stage 2: Dominate by both Business personnel and Analyzer
stage 3: Dominate by Analyzer
15
![Page 16: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/16.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 1
Business staff set all the requirements and most analysis plans
According to experiences, Business staff select features, set
threshold, and IT staff search, integrate data, analyzer make
report
Feature selection and choice of threshold is based on experience
and personal knowledge
Suitable for simple cases, analysis technique is equivalent to the
simplest decision tree
Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
This is common in the traditional industry
16
![Page 17: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/17.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 2
More complex. Business staffs could analyze a small
number of data records while cannot figure out all the
features and the relationship among them. They have no
experience with large number of samples.
Analyzer come to clean data and select features, and finally
build suitable model to solve problem.
Business staffs and analyzer could evaluate the result
together, very likely to success. Analyzer prefer this step
because their ability and value is confirmed.
17
![Page 18: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/18.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Spammer in Wordpress
![Page 19: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/19.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 3
Business staffs have no experience for
the case, and cannot offer any useful
prior knowledge
Data analyzers use various tools and
models to mine the data and trying to
have interesting discovery
It is analyzer’s ideal world, while it is
likely to fail
Business staffs cannot get involved, and
they dislike this stage
19
![Page 20: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/20.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Step Forward
The first stage(Gold on the ground) -> The
second stage(Gold beneath the ground) -> The
third stage (Gold deeply buried)
If analyzers are reckless, business staffs will resist
to help
Data analysis is rooted in the business
background. The goal of analysis is increasing
profit. Successful analysis could not be apart from
business
Interesting topic is more important than the
model
20
![Page 21: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/21.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
What is Big Data
![Page 22: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/22.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Features of Big Data
![Page 23: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/23.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Challenges for Analyzers
Bottleneck for both insertion and query due to the increasing amount of
data
The trend of integrating users’ application and analysis result is asking for
faster real-time computation and response time
More complex models require more expensive computation
23
![Page 24: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/24.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Dilemma of Traditional Data Analysis Tools
R, SAS, SPSS are experimental tools
Capable data size is restricted by the memory size
Use Oracle database for large volume of data, but lack of professional and
fast analyzing ability
Sampling is a limited solution, it is not useful for clustering and
recommendation system
Solution: Hadoop cluster and Map-Reduce parallel computing
24
![Page 25: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/25.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 1: analysis and monitor for a telecommunication company
25
![Page 26: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/26.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 1: analysis and monitor for a telecommunication company
Configuration of the original database server: HP minicomputer, 128G
memory, 48-core CPU, RAC with two nodes, one node for insertion and the
other for query
Storage: HP virtual storage, over 1000 disks
Architecture: Oracle RAC with two nodes
Bottleneck: 1. Insertion 2. Query
26
![Page 27: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/27.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 2: DNA database
27
![Page 28: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/28.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 3: Social analysis, activity fingerprint detection
28 | April 11, 2023 |
Public Voice mail
intersect IMSI 1 IMSI 2 …… IMSI ntotal call duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS intersect IMSI 1 IMSI 2 …… IMSI n
Monthly SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)(0.15, 0.13, …, 0.02, 0.05)
eigenvector
![Page 29: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/29.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity fingerprint detection
29
![Page 30: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/30.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 3: Social analysis, VIP detection
30
![Page 31: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/31.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Solution that analyzers look forward to
Perfectly eliminate the bottleneck in the foreseeable future
Smoothly transplant available techniques, for example SQL and R.
The cost of new platform: hardware and software, re-development, skill
training, maintenance
31
![Page 32: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/32.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Path to Big Data
![Page 33: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/33.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Idea of Hadoop
33
![Page 34: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/34.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce Programming
34
![Page 35: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/35.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce program for meteorological data analysis
35
![Page 36: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/36.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce implementation for popular algorithms
36
![Page 37: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/37.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce implementation for popular algorithms
37
![Page 38: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/38.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Why not Hadoop?
Java?
Hard to control?
Hard to integrate data?
Hadoop vs Oracle
38
![Page 39: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/39.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Analysis under Hadoop system
Mainstream: Java program
Light-weighted script language: Pig
Smooth transplant from SQL: Hive
NoSQL: HBase
39
![Page 40: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/40.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Family of Hadoop
40
![Page 41: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/41.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
pig
Pig could be treated as a client
software to the hadoop, could
connect to hadoop and analyze
Pig is convenient for users
unfamiliar with java, using a SQL-
like language, pig latin, dealing
with data flow
Pig latin could perform sorting,
filtering, sum, grouping,
association, and define custom
functions. It is a light-weighted
script language for data operation
and analysis
Pig could be treated as the
mapping from pig latin to map-
reduce
41
![Page 42: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/42.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Hive
Data warehouse tool, could turn
primary data structure in
Hadoop into tables in Hive
Support HiveQL, a language
almost the same as SQL, its
function is the same as SQL
except updating, indexing and
could be treated as the mapping
from SQL to map-reduce
Offering interfaces for
shell、 JDBC/ODBC、 Thrift、W
eb
42
![Page 43: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/43.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Features of Mahout
Mahout is for scalable machine
learning algorithms (M-R
implementation), and Hadoop
platform is not necessary. The core
library also have efficient algorithms
on single machine
Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43
![Page 44: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/44.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
![Page 45: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/45.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
![Page 46: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/46.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
![Page 47: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/47.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
47
![Page 48: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/48.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Typical Experiment Environtment(with server)
Server: ESXi, capable of deploying multiple virtual machines and could run
3 machines at the same time
PC: Linux or Windows+Cygwin, linux could be standalone or a virtual
machine
SSH: Use command ssh under linux, and SecureCRT or putty under
Windows to connect with remote linux server
Vmware client: Management of ESXi
Hadoop: Use version 1.x or 2.x
48
![Page 49: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/49.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Typical Experiment Environtment(with only PC or laptop running Windows) At Least 4G memory, 64bit windows is preferred, because 32bit machine
can use only more than 3G memory.
Install vmware workstation or virtual box
Deploy 3 virtual machines and running at the same time. If can only run
two VMs, treat host as a node (by cygwin), and use bridged networking for
virtual network
Install Linux and Java
Old computers could consider pseudo-distributed environment
49
![Page 50: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/50.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
Experiment Environment
Deploy Pig
Deploy Hive
Deploy Mahout
![Page 51: Hadoop dev 01](https://reader035.vdocuments.us/reader035/viewer/2022062701/553b122e4a7959ac798b461d/html5/thumbnails/51.jpg)
NYC Data Science AcademyHadoop Application Development with Real Cases
List of Cases of the Course
Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
LBS application for telecommunication company; Analysis of trace of user‘s mobile
phone(Map-Reduce)
User analysis for telecommunication company; Labeling duplicated users by the
fingerprint of calls(Map-Reduce)
Recommendation system for E-commerce company(Map-Reduce)
Complicated recommendation system application(mahout)
Social network; Distance between users; Community detection(Pig)
Importance of nodes in a social network(Map-Reduce)
Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
Financial data analysis; Retrieve reverse repurchase information from historical
data(Hive)
Set stock strategies with data analysis(Map-Reduce, Hive)
GPS application; Sign-in data analysis(Pig)
Implementation and optimization of sorting on Map-Reduce
Middleware development; Cooperation of multiple Hadoop clusters