Transcript
Page 1: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.1

Prabhu Thukkaram

Director, Product DevelopmentOracle Complex Processing & SOA SuiteFeb 28, 2014

Page 2: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2

What is Big Data ?

HDFS

Map Reduce

HBaseColumnar DB

PIG Hive

ETL Tools BI Reporting

Self Healing Clustered Storage System

Distributed Data Processing

Higher level abstraction

Top-level interfaces

Structured Data,, Excel, etcUnstructured & Semi-Structured Data, Web Logs,

Images, etcSQOOP

Zoo Keeper

Page 3: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3

HBase Quick Overview Relational Database – Product Table

Product Id Price SKU Inventory Count

0001 1300 SKU0001 10

0002 2800 SKU0002 25

0003 5600 SKU0003 8

Ideal for OLTP transactions Faster writes and record updates But slow for OLAP

E.g. Select sum(InventoryCount) from Products; Reason:- Data for column “InventoryCount” is not contiguous

Page 4: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4

HBase Quick Overview HBase – Product Table

Product Id 0001 0002 0003

Price 1300 2800 5600

SKU SKU0001 SKU0002 SKU0003

Inventory Count

10 25 8

Extremely fast for OLAP E.g. Select sum(InventoryCount) from Products; Supports big data analytics – single table with million columns and

billion rows

43

Page 5: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5

Hadoop Cluster

Master

Job Tracker

Name Node

Slave 1

Task Tracker

Data Node

Slave 2

Task Tracker

Data Node

Name Node Does not store data, maintains

directory tree of all files in the cluster

Tracks data blocks of a file across the cluster

Client apps like Hadoop Shell/CLI talk to Name Node to locate, create, move, rename, and delete a file

Returns a list of Data Node servers where data lives

Single point of failure, addressed in Hadoop V2 or YARN

Hadoop Shell, CLI, etc

Page 6: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6

Hadoop/HDFS cluster

Master

Job Tracker

Name Node

Slave 1

Task Tracker

Data Node

Slave 2

Task Tracker

Data Node

Data Node Stores & replicates data on the file

system

Connects to Name Node on startup & responds to file system operations

Hadoop Shell/CLI clients can talk directly to Data Node if they know the location of the data

Data Nodes talk to each other when replicating data

Hadoop Shell, CLI, etc

Page 7: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7

Writing a file to HDFS

Hadoop Client

NameNode

DataNode 1 DataNode 6DataNode 5 ……

File.txt

Blk A

Blk B

Blk C

Wants to write Blocks A, B, C of File.txt

Ok, write to data nodes 1,5,6

Blk A Blk B Blk CBlk A

Replication of Blk A

Client consults Name Node

Client writes block directly to one Data Node

That Data Node replicates the block

Client writes next block and the cycle repeats

Page 8: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8

Hadoop/HDFS cluster

Master

Job Tracker

Name Node

Slave 1

Task Tracker

Data Node

Slave 2

Task Tracker

Data Node

Job Tracker – Accepts MR jobs from Clients Contacts & submits MR tasks to

Task Trackers on located Data Nodes

Monitors Task Tracker nodes for heartbeat/failures and resubmits to a different Task Tracker as needed

Updates the status of a job when complete

Single point of failure, fixed in MRV2 or YARN

Note:- Jobs run as batch and clients can retrieve the status by querying the Job Tracker

Hadoop Shell, CLI, etc

Page 9: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9

Hadoop/HDFS cluster

Master

Job Tracker

Name Node

Slave 1

Task Tracker

Data Node

Slave 2

Task Tracker

Data Node

Task Tracker – Accepts map, & reduce tasks from Job Tracker A predefined set of slots

determine the number of tasks it can accept

Spawns a separate JVM process for the task

Notifies the Job Tracker when the process finishesHadoop

Shell, CLI, etc

Page 10: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10

Map Reduce Example

• User submits a MR job/jar for input file URL hdfs://File.txt to Job Submitter on the Client JVM. Job Submitter Client contacts Job Tracker to obtain a Job Id

• Job Tracker creates Map Tasks based on the number of input splits. Reduce jobs are defined by the job itself, configured or in API call setNumReduceTasks()

• Client contacts Name Node to compute input splits. Copies the job jar, computed input splits, etc. to Job Tracker’s file system with a directory named after the Job Id. Submits the Job. Job Tracker adds job to its Queue.

Page 11: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11

Map Reduce Example

• Map Task splits records and passes each record to user’s Map logic code

• In above example, user’s Map logic tokenizes each record to generate one or more key value pairs.

• Output of a Map task is partitioned as per the defined # of reducers, shuffled, and sorted. Each partition output is then routed to its Reducer

• Reducer merges partition output from other Map tasks in the cluster and calls user’s reduce logic

• M/R guarantees that the input to every Reducer is sorted by key

Page 12: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12

Map Reduce – Under the Hood

Source: Hadoop, The Definitive Guide.

User’s Map Logic

Record Split

Sorted & partitioned

Single Map Task

Merged

User’s Reduce Logic

Single Reduce Task

From other Map Tasks

To other Reducers

Data Block

Page 13: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13

Map Reduce – Summary

Map – Process of organizing entity data as key-value pairs. Key could be Customer Id, Purchase Order Id, etc.

M/R Framework – Ensures all data relevant to an entity/key is delivered to a single Reducer

Reduce – Process of aggregating data related to an Entity and deducing information.

Page 14: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14

Risk Analysis - Large no of CC txns, no direct deposit into checking over the last two months implies the customer is unemployed and at high risk of defaulting

Map Reduce - Risk Analysis

HDFS Global View of

Customer

Credit Card Txns

Chat Session

Checking account deposits and withdrawals

Map Phase

Reduce Phase

Risk Score

Gathers all data (CC txns, chats, withdrawals, etc.) pertaining to a single customer.

Data in HDFS changes frequently and hence the need to reevaluate risk using batch jobs. Reevaluated results are written to a DB to enable business decisions. E.g. To approve a new credit card or loan

Page 15: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15

Hadoop V1 - Limitations Cluster resource management is tightly coupled to Map Reduce Job

Tracker Job Tracker Functionality in V1

Cluster resource management Application life-cycle Management (Job Scheduling/Re-scheduling/Monitoring)

Can only run Map Reduce applications, poor utilization of cluster

Need to run other kinds of applications – Real-time, Graph, Messaging, etc.

Scalability & single point of failure (Name Node & Job Tracker)

Lack of wire compatibility protocols

Page 16: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16

Hadoop MRV2 or YARN

Page 17: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.17

Hadoop MRV2 or YARN

Job Tracker in V1 split into Global Resource Manager

Application Master per Job Request

Node Manager

Application - classic MR job or a DAG of jobs

Resource Manager

Node Manager

ContainerApp

Master

Node Manager

ContainerApp

Master

Node Manager

Container Container

Client Client

Job StatusJob SubmissionNode StatusResource Request

Page 18: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.18

Hadoop MRV2 or YARN

Resource Manager

Resource Manager Ultimate authority for managing and scheduling resources in

cluster

Works with the Node Manager to track and utilize available containers

Container is the unit of resource in YARN. E.g. 2 Cores & 2 GB memory, Disk, etc.

Accepts Jobs from clients and delegates it to an Application Master

Page 19: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.19

Hadoop MRV2 or YARN

App Master

Application Master Negotiates the required resources for job with RM

Tracks job status and monitors progress

By shifting job control to App Master in local slaves, YARN provides better scale out and fault tolerance

Page 20: Hadoop and Big Data Overview

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.20

Big Data Overview - Next


Top Related