hadoop admin presentation demo | introduction | basics

26
Introduction to By Laxmi Edi M.Tech (Ph.D)

Upload: kernel-training

Post on 06-Jan-2017

5.282 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Hadoop admin presentation demo | Introduction | Basics

Introduction to

By Laxmi Edi M.Tech (Ph.D)

Page 2: Hadoop admin presentation demo | Introduction | Basics

TopicsWhat is Big Data? Limitations of the existing solutions Solving the problem with Hadoop Introduction to Hadoop HadoopEco-System Hadoop Core Components HDFS Architecture MapRedcueJob execution Anatomy of a File Write and Read

www.kerneltraining.com/course-cat/big-data

Page 3: Hadoop admin presentation demo | Introduction | Basics

What Is Big Data?•Lots of Data (Terabytes or Petabytes)

•Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. •Big data is the term for a collection of data sets so large and complexthat it becomes difficultto process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

•Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytesof information.

www.kerneltraining.com/course-cat/big-data

Page 4: Hadoop admin presentation demo | Introduction | Basics

NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.

www.kerneltraining.com/course-cat/big-data

Page 5: Hadoop admin presentation demo | Introduction | Basics

Where does Big Data come from?Now the next question would be from where this Big Data originates, what makes the Big Data? Basically the data coming from everywhere like

• sensors used to gather climate information• posts to social media sites• digital pictures and videos• software logs, cameras• microphones• scans of government documents• GPS trails• purchase transaction records• cell phone GPS signals• traffic• and many more.

All these together constitute Big Data.

www.kerneltraining.com/course-cat/big-data

Page 7: Hadoop admin presentation demo | Introduction | Basics

Big DataCharacteristicswww.kerneltraining.com/course-cat/big-data

1. Volume: BIG DATA depends upon how gigantic it is. It could amount to hundreds of terabytes or even petabytes of information.  For instance, 15 terabytes of facebook posts or 400 billion annual medical records could mean Big Data!2. Velocity:Velocity means the rate at which data is flowing in the companies. Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, processing 2 millionrecords at share market or evaluating results of lakhs of students applied for competitive exams could mean Big Data!3. Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial amount of an organization’s data is not numeric; however, such data is equally important for decision-making process. So, organizations need to think beyond stock records, documents, personnel files, finances, etc. 

Page 8: Hadoop admin presentation demo | Introduction | Basics

Question

www.kerneltraining.com/course-cat/big-data

Map the following to corresponding data type:

-XML Files

-Word Docs, PDF files, Text files

-E-Mail body

-Data from Enterprise systems (ERP, CRM etc.)

Page 9: Hadoop admin presentation demo | Introduction | Basics

Answer

www.kerneltraining.com/course-cat/big-data

XML Files -> Semi-structured data

Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data

Data from Enterprise systems (ERP, CRM etc.) -> Structured Data

Page 10: Hadoop admin presentation demo | Introduction | Basics

Big Data Customer Scenarios

www.kerneltraining.com/course-cat/big-data

Web and e-tailingRecommendation EnginesAd TargetingSearch QualityAbuse and Click Fraud DetectionTelecommunications Customer Churn PreventionNetwork Performance OptimizationCalling Data Record (CDR) AnalysisAnalyzing Network to Predict Failure

Page 11: Hadoop admin presentation demo | Introduction | Basics

Big Data Customer Scenarios

www.kerneltraining.com/course-cat/big-data

Fraud Detection And Cyber Security Welfare schemes JusticeHealthcare & Life SciencesHealth information exchangeGene sequencingSerializationHealthcare service quality improvementsDrug Safety

Page 12: Hadoop admin presentation demo | Introduction | Basics

Big Data Customer Scenarios

www.kerneltraining.com/course-cat/big-data

Banks and Financial services ModelingTrue RiskThreat AnalysisFraud DetectionTrade SurveillanceCredit Scoring And AnalysisRetail Point of sales Transaction AnalysisCustomer Churn AnalysisSentiment Analysis

Page 14: Hadoop admin presentation demo | Introduction | Basics

What Is Hadoop

www.kerneltraining.com/course-cat/big-data

Apache Hadoopis a frameworkthat allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage & distributed processing.

Page 15: Hadoop admin presentation demo | Introduction | Basics

Why Hadoop?

www.kerneltraining.com/course-cat/big-data

Key features – Why Hadoop?1. Flexible2.  Scalable3. Building more efficient data economy:4. Robust Ecosystem5. Hadoop is getting more “Real-Time”!6. Cost Effective:7.  Upcoming Technologies using Hadoop:8.  Hadoop is getting Cloudy!

Page 16: Hadoop admin presentation demo | Introduction | Basics

Question

www.kerneltraining.com/course-cat/big-data

Hadoop is a framework that allows for the distributed processing of:

-Small Data Sets

-Large Data Sets

Page 17: Hadoop admin presentation demo | Introduction | Basics

Answer

www.kerneltraining.com/course-cat/big-data

Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in TB’s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

Page 19: Hadoop admin presentation demo | Introduction | Basics

Machine Learning with Mahout

www.kerneltraining.com/course-cat/big-data

•Mahout is a data mining library. •It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Page 20: Hadoop admin presentation demo | Introduction | Basics

Hadoop Core Components

www.kerneltraining.com/course-cat/big-data

Hadoopis a system for large scale data processing.It has two main components:HDFS –Hadoop Distributed File System(Storage)Distributed across “nodes”Natively redundantNameNodetracks locations.MapReduce(Processing) Splits a task across processors“near” the data & assembles resultsSelf-Healing, High BandwidthClustered storageJobTrackermanages the TaskTrackers

Page 21: Hadoop admin presentation demo | Introduction | Basics

Hadoop Core Components (Contd.)

www.kerneltraining.com/course-cat/big-data

Page 22: Hadoop admin presentation demo | Introduction | Basics

HDFS Architecturewww.kerneltraining.com/course-cat/big-data

HDFS has master/slave architecture.

An HDFS cluster consists of a single Master Node, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Slave Nodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

Page 23: Hadoop admin presentation demo | Introduction | Basics

Main Components Of HDFS

www.kerneltraining.com/course-cat/big-data

NameNode: master of the systemmaintains and manages the blocks which are present on the DataNodes DataNodes: slaves which are deployed on each machine and provide the actual storageresponsible for serving read and write requests for the clients

Page 24: Hadoop admin presentation demo | Introduction | Basics

HDFS - Read Anatomy

www.kerneltraining.com/course-cat/big-data

requests the block from first datanode on the list. It tries two times and if no response then it adds the datanode to "deadnodes" list. And requests block from next datanode on the list.

7-8. After usccessful read of all the blocks, "DFSClient" send the deadnodes list back to NN for it to take action. 

1. Client request the document

2. NN, checks the permissions and sends back the list of blocks and datanodes list (including port number to talk) for each block. 

3-6. "DFSClient" class on client-side picks up first block and

Page 25: Hadoop admin presentation demo | Introduction | Basics

HDFS - Write Anatomy

www.kerneltraining.com/course-cat/big-data

Client has to write directly to datanode. However each datanodes has to notify receipt of each block back to client and namenode. Also each datanode passes on the block to next datanode to write, that means client has to transmit block to only first datanode and rest of the block movement is handled inside the cluster.  Here is the flow of data file create and write on HDFS.  

Create and Write of HDFS file •Creation and writing of a file is more complicated than the read of a HDFS file.

•Here also NameNode(NN) never writes any data directly to DataNodes(DN). It, as per it's role, only manages the namespace and inodes.

Page 26: Hadoop admin presentation demo | Introduction | Basics

THANK YOUfor attending Demo of

www.kerneltraining.com/course-cat/big-data