introduction to big data and hadoop
TRANSCRIPT
Edureka Contact : [email protected] www.edureka.co/big-data-and-hadoop
I n t r o d u c t i o n t o b i g d a t a a n d h a d o o p
Objectives
www.edureka.co/big-data-and-hadoopSlide 2 Edureka Contact : [email protected]
At the end of this session , you will understand the:
Big Data Introduction
Use Cases of Big Data in Multiple Industry Verticals
Hadoop and Its Eco-System
Hadoop Architecture
Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists
Un-structured Data is Exploding
Source: Twitter
www.edureka.co/big-data-and-hadoopSlide 3 Edureka Contact : [email protected]
IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
www.edureka.co/big-data-and-hadoopSlide 4
Annie’s Introduction
www.edureka.co/big-data-and-hadoopSlide 5
Hello There!!My name is Annie.I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Question
Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)
www.edureka.co/big-data-and-hadoopSlide 6 Edureka Contact : [email protected]
Annie’s Answer
Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured dataData from Enterprise systems (ERP, CRM etc.) Structured data
www.edureka.co/big-data-and-hadoopSlide 7 Edureka Contact : [email protected]
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
www.edureka.co/big-data-and-hadoopSlide 8 Edureka Contact : [email protected]
Common Big Data Customer Scenarios
Web and e-tailing
» Recommendation Engines» Ad Targeting» SearchQuality» Abuse and Click Fraud Detection
Telecommunications
» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 9 Edureka Contact : [email protected]
Government
» Fraud Detection and Cyber Security» Welfare Schemes» Justice
Healthcare and Life Sciences
» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety
Common Big Data Customer Scenarios (Contd.)
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 10 Edureka Contact : [email protected]
Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
» Modeling True Risk» ThreatAnalysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis
Retail
» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.co/big-data-and-hadoopSlide 11 Edureka Contact : [email protected]
Why DFS?
Read 1 TB Data
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine4 I/O ChannelsEach Channel – 100 MB/s
www.edureka.co/big-data-and-hadoopSlide 12 Edureka Contact : [email protected]
Why DFS? (Contd.)
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine4 I/O ChannelsEach Channel – 100 MB/s
www.edureka.co/big-data-and-hadoopSlide 13 Edureka Contact : [email protected]
43 Minutes
Read 1 TB Data
Why DFS? (Contd.)
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine4 I/O ChannelsEach Channel – 100 MB/s
www.edureka.co/big-data-and-hadoopSlide 14 Edureka Contact : [email protected]
4.3 Minutes43 Minutes
Read 1 TB Data
www.edureka.co/big-data-and-hadoopSlide 15 Edureka Contact : [email protected]
Hadoop Cluster: A Typical Use Case
Active NameNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 Cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Secondary NameNode
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
DataNode
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores. Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
DataNode
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
StandBy NameNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 Cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
www.edureka.co/big-data-and-hadoopSlide 16 Edureka Contact : [email protected]
Hidden Treasure
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
Case Study: Sears Holding Corporation
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
www.edureka.co/big-data-and-hadoopSlide 17 Edureka Contact : [email protected]
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Instrumentation
A meagre 10% of the
~2PB data is available for
BI
Storage
2. Moving data to compute doesn’t scale
90% of the ~2PB archived
Processing
3. Premature data death
1. Can’t explore original high fidelity raw data
Limitations of Existing Data Analytics Architecture
www.edureka.co/big-data-and-hadoopSlide 18 Edureka Contact : [email protected]
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both Storage
And Processing
Entire ~2PB Data is
available for processing
No Data Archiving
1. Data Exploration & Advanced analytics
2. Scalable throughput for ETL & aggregation
Mostly Append
3. Keep data alive forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% aswas the case with existing Non-Hadoop solutions.
www.edureka.co/big-data-and-hadoopSlide 19 Edureka Contact : [email protected]
Solution: A Combined Storage Computer Layer
Annie’s Question
Hadoop is a framework that allows for the distributedprocessing of:» Small Data Sets» Large Data Sets
www.edureka.co/big-data-and-hadoopSlide 20 Edureka Contact : [email protected]
Annie’s Answer
Ans. Large Data Sets.It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
www.edureka.co/big-data-and-hadoopSlide 21 Edureka Contact : [email protected]
Hadoop Ecosystem
Pig LatinData Analysis
HiveDWSystem
Other YARN
Frameworks(MPI,GRAPH)
HBaseMapReduce Framework
YARNCluster Resource Management
Apache Oozie (Workflow)
HDFS(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or Semi-structured Data Structured Data
Flume
MahoutMachine Learning
www.edureka.co/big-data-and-hadoopSlide 22 Edureka Contact : [email protected]
Hadoop Cluster: Facebook
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features called
Hive(see the http://Hadoop.apache.org/hive/). We have also developed a
FUSE implementation over HDFS.
www.edureka.co/big-data-and-hadoopSlide 23 Edureka Contact : [email protected]
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4,…)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPCMPI(OpenMPI)
OTHER(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
www.edureka.co/big-data-and-hadoopSlide 24 Edureka Contact : [email protected]
YARN – Moving beyond MapReduce
Pseudo-Distributed Mode
Hadoop daemons run on the local machine.
Fully-Distributed Mode
Hadoop daemons run on a cluster of machines.
Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.
www.edureka.co/big-data-and-hadoopSlide 25 Edureka Contact : [email protected]
Hadoop Cluster Modes
Big Data Learning Path
• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark
• Hadoop Essentials• Expertise in R
Developer/Testing
Administration
• Linux Administration• Cluster Management• Cluster Performance• Virtualization
Data Analyst
• Statistics Skills• Machine Learning
Big Data and Hadoop
www.edureka.co/big-data-and-hadoopSlide 26 Edureka Contact : [email protected]
MapReduceDesign Patterns
Apache Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics Using R
Advance Predictive Modelling in R
Talend for Big Data
Data Visualization Using Tableau
Learning Path to Certification
CourseLIVE Online Class Class Recording in LMS
24/7 Post Class Support Module Wise Quiz and Assignment
Verifiable Certificate
ProjectWork
1. Assistance from Peers and Support team
2. Review for Certification
www.edureka.co/big-data-and-hadoopSlide 27 Edureka Contact : [email protected]
Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
www.edureka.co/big-data-and-hadoopSlide 28 Edureka Contact : [email protected]
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
www.edureka.co/big-data-and-hadoopSlide 29
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoopSlide 30 Edureka Contact : [email protected]
Survey