Download - Hadoop and Big Data Overview
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.1
Prabhu Thukkaram
Director, Product DevelopmentOracle Complex Processing & SOA SuiteFeb 28, 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2
What is Big Data ?
HDFS
Map Reduce
HBaseColumnar DB
PIG Hive
ETL Tools BI Reporting
Self Healing Clustered Storage System
Distributed Data Processing
Higher level abstraction
Top-level interfaces
Structured Data,, Excel, etcUnstructured & Semi-Structured Data, Web Logs,
Images, etcSQOOP
Zoo Keeper
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3
HBase Quick Overview Relational Database – Product Table
Product Id Price SKU Inventory Count
0001 1300 SKU0001 10
0002 2800 SKU0002 25
0003 5600 SKU0003 8
Ideal for OLTP transactions Faster writes and record updates But slow for OLAP
E.g. Select sum(InventoryCount) from Products; Reason:- Data for column “InventoryCount” is not contiguous
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4
HBase Quick Overview HBase – Product Table
Product Id 0001 0002 0003
Price 1300 2800 5600
SKU SKU0001 SKU0002 SKU0003
Inventory Count
10 25 8
Extremely fast for OLAP E.g. Select sum(InventoryCount) from Products; Supports big data analytics – single table with million columns and
billion rows
43
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5
Hadoop Cluster
Master
Job Tracker
Name Node
Slave 1
Task Tracker
Data Node
Slave 2
Task Tracker
Data Node
Name Node Does not store data, maintains
directory tree of all files in the cluster
Tracks data blocks of a file across the cluster
Client apps like Hadoop Shell/CLI talk to Name Node to locate, create, move, rename, and delete a file
Returns a list of Data Node servers where data lives
Single point of failure, addressed in Hadoop V2 or YARN
Hadoop Shell, CLI, etc
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6
Hadoop/HDFS cluster
Master
Job Tracker
Name Node
Slave 1
Task Tracker
Data Node
Slave 2
Task Tracker
Data Node
Data Node Stores & replicates data on the file
system
Connects to Name Node on startup & responds to file system operations
Hadoop Shell/CLI clients can talk directly to Data Node if they know the location of the data
Data Nodes talk to each other when replicating data
Hadoop Shell, CLI, etc
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7
Writing a file to HDFS
Hadoop Client
NameNode
DataNode 1 DataNode 6DataNode 5 ……
File.txt
Blk A
Blk B
Blk C
Wants to write Blocks A, B, C of File.txt
Ok, write to data nodes 1,5,6
Blk A Blk B Blk CBlk A
Replication of Blk A
Client consults Name Node
Client writes block directly to one Data Node
That Data Node replicates the block
Client writes next block and the cycle repeats
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8
Hadoop/HDFS cluster
Master
Job Tracker
Name Node
Slave 1
Task Tracker
Data Node
Slave 2
Task Tracker
Data Node
Job Tracker – Accepts MR jobs from Clients Contacts & submits MR tasks to
Task Trackers on located Data Nodes
Monitors Task Tracker nodes for heartbeat/failures and resubmits to a different Task Tracker as needed
Updates the status of a job when complete
Single point of failure, fixed in MRV2 or YARN
Note:- Jobs run as batch and clients can retrieve the status by querying the Job Tracker
Hadoop Shell, CLI, etc
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9
Hadoop/HDFS cluster
Master
Job Tracker
Name Node
Slave 1
Task Tracker
Data Node
Slave 2
Task Tracker
Data Node
Task Tracker – Accepts map, & reduce tasks from Job Tracker A predefined set of slots
determine the number of tasks it can accept
Spawns a separate JVM process for the task
Notifies the Job Tracker when the process finishesHadoop
Shell, CLI, etc
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10
Map Reduce Example
• User submits a MR job/jar for input file URL hdfs://File.txt to Job Submitter on the Client JVM. Job Submitter Client contacts Job Tracker to obtain a Job Id
• Job Tracker creates Map Tasks based on the number of input splits. Reduce jobs are defined by the job itself, configured or in API call setNumReduceTasks()
• Client contacts Name Node to compute input splits. Copies the job jar, computed input splits, etc. to Job Tracker’s file system with a directory named after the Job Id. Submits the Job. Job Tracker adds job to its Queue.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11
Map Reduce Example
• Map Task splits records and passes each record to user’s Map logic code
• In above example, user’s Map logic tokenizes each record to generate one or more key value pairs.
• Output of a Map task is partitioned as per the defined # of reducers, shuffled, and sorted. Each partition output is then routed to its Reducer
• Reducer merges partition output from other Map tasks in the cluster and calls user’s reduce logic
• M/R guarantees that the input to every Reducer is sorted by key
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12
Map Reduce – Under the Hood
Source: Hadoop, The Definitive Guide.
User’s Map Logic
Record Split
Sorted & partitioned
Single Map Task
Merged
User’s Reduce Logic
Single Reduce Task
From other Map Tasks
To other Reducers
Data Block
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13
Map Reduce – Summary
Map – Process of organizing entity data as key-value pairs. Key could be Customer Id, Purchase Order Id, etc.
M/R Framework – Ensures all data relevant to an entity/key is delivered to a single Reducer
Reduce – Process of aggregating data related to an Entity and deducing information.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14
Risk Analysis - Large no of CC txns, no direct deposit into checking over the last two months implies the customer is unemployed and at high risk of defaulting
Map Reduce - Risk Analysis
HDFS Global View of
Customer
Credit Card Txns
Chat Session
Checking account deposits and withdrawals
Map Phase
Reduce Phase
Risk Score
Gathers all data (CC txns, chats, withdrawals, etc.) pertaining to a single customer.
Data in HDFS changes frequently and hence the need to reevaluate risk using batch jobs. Reevaluated results are written to a DB to enable business decisions. E.g. To approve a new credit card or loan
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15
Hadoop V1 - Limitations Cluster resource management is tightly coupled to Map Reduce Job
Tracker Job Tracker Functionality in V1
Cluster resource management Application life-cycle Management (Job Scheduling/Re-scheduling/Monitoring)
Can only run Map Reduce applications, poor utilization of cluster
Need to run other kinds of applications – Real-time, Graph, Messaging, etc.
Scalability & single point of failure (Name Node & Job Tracker)
Lack of wire compatibility protocols
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16
Hadoop MRV2 or YARN
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.17
Hadoop MRV2 or YARN
Job Tracker in V1 split into Global Resource Manager
Application Master per Job Request
Node Manager
Application - classic MR job or a DAG of jobs
Resource Manager
Node Manager
ContainerApp
Master
Node Manager
ContainerApp
Master
Node Manager
Container Container
Client Client
Job StatusJob SubmissionNode StatusResource Request
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.18
Hadoop MRV2 or YARN
Resource Manager
Resource Manager Ultimate authority for managing and scheduling resources in
cluster
Works with the Node Manager to track and utilize available containers
Container is the unit of resource in YARN. E.g. 2 Cores & 2 GB memory, Disk, etc.
Accepts Jobs from clients and delegates it to an Application Master
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.19
Hadoop MRV2 or YARN
App Master
Application Master Negotiates the required resources for job with RM
Tracks job status and monitors progress
By shifting job control to App Master in local slaves, YARN provides better scale out and fault tolerance
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.20
Big Data Overview - Next