big data overview & hadoop for dba’s big data day- bi… · what is the main difference in...

96
© Copyright 2016. Apps Associates LLC. 1 Big Data Overview & Hadoop for DBA’s Satyendra Pasalapudi Associate Practice Director Apps Associates LLC

Upload: dangthuy

Post on 27-Feb-2018

224 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 1

Big Data Overview & Hadoop for DBA’s

Satyendra Pasalapudi Associate Practice Director Apps Associates LLC

Page 3: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 3

www.ora-search.com

Page 4: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 4

History of Data Management Systems

Magnetic tape

“flat” (sequential) files

Pre-computer technologies:

Printing press Dewey decimal system Punched cards

Magnetic Disk

IMS

Relational Model defined

Indexed-Sequential Access Mechanism (ISAM)

Network Model

IDMS

ADABAS System R

Oracle V2

Ingres

dBase

DB2

Informix

Sybase

SQL Server

Access

Postgres

MySQL

Cassandra

Hadoop

Vertica

Riak

HBase

Dynamo

MongoDB

Redis

VoltDB

Hana

Neo4J

Aerospike

Hierarchical model

1960-70 1940-50 1950-60 1970-80 1980-90 1990-2000

2000-2010

Page 5: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

The Role of Data

is Changing

Page 6: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 6

Until now, Questions you ask drove Data model

New model is collect as much data as possible – “Data-First Philosophy”

Page 7: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 7

Data is the new raw material for

any business on par with

capital, people, labor

Data is the new raw material for any business on par

with capital, people, labor

Page 8: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 8

Characteristics of Big Data

Page 9: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 9

Cost effectively manage

and analyze

all available data in its

native form

unstructured,

structured, streaming

ERP CRM

RFID

Website

Network Switches

Social Media

Billing

Big data Challenge

Page 10: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 10

Hybrid Cloud Framework

HR FIN

SCOM SALES

PROCUREMENT

PLANNING

DW / BI

Page 11: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 11

Big data Eco System

Page 12: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 12

Not Easy to Get Analytic Value at Fast Enough Pace

1

2

Tool Complexity • Early Hadoop tools only for experts

• Existing BI tools not designed for Hadoop

• Emerging solutions lack broad capabilities

80% effort

typically spent on

evaluating and

preparing data

Data Uncertainty • Not familiar and overwhelming

• Potential value not obvious

• Requires significant manipulation

Overly dependent

on scarce and

highly skilled

resources

Source : Oracle

Page 13: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 13

Informatica Study May 2013

Addressed by Oracle Big Data Discovery

Key Challenges in Managing Big Data

Page 14: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 14

Sample of Big Data Use Cases Today

MEDIA/ ENTERTAINMENT

Viewers / advertising effectiveness Cross Sell

COMMUNICATIONS

Location-based advertising

EDUCATION & RESEARCH

Experiment sensor analysis

Retail / CPG

Sentiment analysis Hot products

Optimized Marketing

HEALTH CARE

Patient sensors, monitoring, EHRs Quality of care

LIFE SCIENCES

Clinical trials Genomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.

Mfg quality Warranty analysis

OIL & GAS

Drilling exploration sensor analysis

FINANCIAL SERVICES

Risk & portfolio analysis New products

AUTOMOTIVE

Auto sensors reporting location, problems

Games

Adjust to player behavior In-Game Ads

LAW ENFORCEMENT & DEFENSE

Threat analysis - social media monitoring, photo analysis

TRAVEL & TRANSPORTATION

Sensor analysis for optimal traffic flows Customer sentiment

UTILITIES

Smart Meter analysis for network capacity,

ON-LINE SERVICES / SOCIAL MEDIA

People & career matching Web-site

optimization

What is the main difference in this data?

Volume, Velocity, Variety

These Characteristics Challenge Your Existing Architecture

Page 15: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 15

Big Data Verticals

Media/Advertising

Targeted Advertisin

g

Image and Video Processin

g

Oil & Gas

Seismic Analysis

Retail

Recommend

Transactions

Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo

Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recogniti

on

Social Network/Gaming

User Demograp

hics

Usage analysis

In-game metrics

Page 16: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 16

Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)

Page 17: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 17

Enterprise Data Hub / Data Lake / Data Reservoir

Page 18: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 18

Hadoop Data Reservoir Momentum

1

8

Hadoop Revenue and Forecast 49% CAGR, 2013-2018

Big Data Infrastructure Market $20.7b in 2018

Big Data Software

Market

$9b in 2018

Data Warehouse

Existing Sources Emerging Sources

Data Reservoir Data Warehouse

Source : Oracle

Page 19: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 19

Traditional Systems Under Pressure

AP

PLI

CA

TIO

NS

DA

TA S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

• Silos of Data • Costly to Scale • Constrained Schemas

Clickstream

Geolocation

Sentiment, Web Data

Sensor, Machine Data (IoT)

Unstructured docs, emails

Server logs

SOU

RC

ES

Existing Sources (CRM, ERP,…)

RDBMS EDW MPP

New Data Types

…and difficult to manage new data

Page 20: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 20

Hadoop Enabled the Modern Data Architecture

Common Data Set, Multiple Applications • Optionally land all data in a single cluster

• Batch, interactive & real-time use cases

• Support multi-tenant access, processing & segmentation of data

YARN: Architectural Center of Hadoop • Consistent security, governance &

operations

• Ecosystem applications run natively in Hadoop

SOU

RC

ES

EXISTING Systems

Clickstream Web &Social

Geolocation Sensor & Machine

Server Logs

Unstructured

AP

PLI

CA

TIO

NS

DA

TA S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS EDW MPP YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-Time Batch

Page 21: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 21

z BY INDUSTRY & LINE OF BUSINESS

BIG

DA

TA

A

PP

LIC

AT

ION

S

DISCOVERY

BU

SIN

ES

S

AN

ALY

TIC

S

BUSINESS ANALYTICS

DATA RESERVOIR

BIG

DA

TA

M

AN

AG

EM

EN

T

DATA WAREHOUSE

SO

UR

CE

S

Big Data Footprint & Scope of Architecture

Page 22: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 22

Architecture Vision Common Emerging Platform Pattern

Data Warehouse / Data Marts

Business Intelligence Tools

ERP, CRM & Other Transactional Apps

Historic Source of Truth Reporting, Query and

Analysis Tools

Information Discovery Engine

Advanced Analytics

Website Logs & Data NoSQL DB

Sensors

Hadoop High Volume Distributed File System

Structured Data

Semi-structured Data

Real-Time Analytics and Recommendations

Recommend Location & User Profile

R, SAS

Discoveries

Page 23: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 23

Potential Oracle Products in the Footprint

Endeca Information Discovery on Exalytics

Cloudera HDFS on Big Data Appliance

Reliable, Available, Secure Source of Truth

Fast, Intuitive Data Discovery

Website Logs & Data Oracle NoSQL

DB

Real-time Recommendations

Analyst Friendly Reporting Query & Analysis Tools

Unstructured Data Analysis

Sensors

Oracle Database DW on Exadata

Oracle BI Foundation Suite, Hyperion on

Exalytics

Oracle ERP & CRM Solutions on Exadata

Oracle Real-Time Decisions

Structured Data Analysis

Big Data Connectors

ODI

OEP

Advanced Analytics, In-Memory, Big

Data SQL

R, SAS

Page 24: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 24

Oracle’s Unified Information Management

Page 25: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 25

Oracle Big Data Management System

SOU

RC

ES

Oracle Database

Oracle Industry Models

Oracle Advanced

Analytics

Oracle Spatial & Graph

Big Data Appliance

Cloudera Hadoop

Oracle NoSQL Database

Oracle R Advanced Analytics for Hadoop

Oracle R Distribution

Oracle Database

Oracle Advanced Security

Oracle Advanced

Analytics

Oracle Spatial & Graph

Oracle Exadata

Oracle Big Data Connectors

Oracle Data Integrator

B

Oracle Big Data SQL

Page 26: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 26

Oracle Big Data Management System

Page 27: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

We Need Tools Built Specifically

for Big Data

Page 28: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 29

Hadoop and it’s Eco System

• Scale out Easily

• Parallel Computing

• Commodity Hardware

• Solves some Problems

• Complex to Run

• Special Skills to Maintain

Cassandra

Page 29: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 30

ETL for Unstructured Data

Page 30: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 31

ETL for Structured Data

Page 31: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 32

Hadoop Design Principles

• System shall manage and heal itself

– Automatically and transparently route around failure

– Speculatively execute redundant tasks if certain nodes are detected to be slow

• Performance shall scale linearly

– Proportional change in capacity with resource change

• Compute should move to data

– Lower latency, lower bandwidth

• Simple core, modular and extensible

Page 32: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 33

Hadoop History

• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

• May 2009 – Hadoop sorts Petabyte in 17 hours

Page 33: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Google File System (GFS)

Map Reduce BigTable

Google Applications

Google Software Architecture (circa 2005)

Page 34: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Start Reduce Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Reduce

Page 35: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 36

Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Data Access

Sqoop

Flume

Client Access

Hue

Hive(Sql)

Pig(Pl/Sql)

Zo

oK

ee

pe

r

(Coo

rdin

atio

n)

(Streaming/Pipes APIs)

Ch

ukw

a (

Mo

nito

rin

g)

Data Mining

Mahout

OS – Redhat, Suse, Ubuntu,Windows

Commodity Hardware

Java Virtual Machine

Networking

Orchestration

Oozie

Page 36: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 37

Hadoop – Simplified View

• MPP (Massively Parallel) hardware running database-like software

• “Data” is stored in parts, across multiple worker nodes

• “Work” operates in parallel, on the different parts of the table

Controller Worker Nodes

Page 37: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 38

HDFS Architecture

Page 38: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

HDFS Architecture

Namenode

B replication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..

Block ops

Page 39: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 40

Head Node Data 1 Data 2 Data 3 Data 4

MYFILE.TXT

..block1 -> block1

..block2 -> block2

..block3 -> block3

HDFS – Highly Available

Page 40: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 41

Namenode and Datanodes

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

Page 41: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Hadoop 1 – Job & Task Trackers

Master Node - The majority of hadoop deployments consist of sevaral master node

instances. Having more than one master node helps eliminate the risk of single

point of failure.

NameNode - These processes are charged with storing a directory tree of all files

in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the

file data is kept within in the cluster. Client Applications contact Name Nodes when

they need to locate a file, or add, or copy or delete a file.

DataNodes - The datanode stores data in the HDFS and is responsible for

replicating data across clusters. Data Nodes interact with client applications when

the NameNopde has supplied the Datanode's address.

WorkerNode: Unlike a master node, whose numbers we can count on one hand, a

representative Hadoop Deployment consists of dozens or hundreds of worker

nodes, which provides enough processing power to analyze a

few hundreds terabytes all the way upto one petabyte. Each worker node includes

a DataNode as well as Task Tracker.

Page 42: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Map Reduce

Job Tracker /MapReduce Workload Management Layer - This

process is assigned to interact with client applications. It is

responsible for distributing MapReduce tasks to particular nodes

within in a cluster. This engine coordinates all aspects of hadoop

such as scheduling and launching jobs.

Task Tracker - This is a process in the cluster that is capable of

receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job

Tracker

Page 43: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 44

Data Replication Similar to that of ASM

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode.

Page 44: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 45

Replica Placement & Rack Aware

The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement:

Goal: improve reliability, availability and network bandwidth utilization

Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks

Simple but non-optimal Writes are expensive Replication factor is 3

Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

Page 45: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 46

Replica Selection

• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.

• If there is a replica on the Reader node then that is preferred.

• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.

Page 46: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 47

Hadoop Components

• Hadoop is bundled with two independent components

– HDFS (Hadoop Distributed File System)

• Designed for scaling in terms of storage and IO bandwidth

– MR framework (MapReduce)

• Designed for scaling in terms of performance

Page 47: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 48

Understanding file structure

1 GB file

File is split into

blocks

Each block is typically 64MB

Each block is stored as two files – one holding

data and second for metadata, checksum

Bloc

k

Page 48: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 49

Hadoop Processes

• Processes running on Hadoop

– NameNode

– DataNode

– Secondary NameNode

– Task Tracker

– Job Tracker

Page 49: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 50

NameNode

• Single point of contact

• HDFS master

• Holds meta information

– List of files and directories

– Location of blocks

• Single node per cluster

– Cluster can have thousands of DataNodes and tens of thousands of HDFS client.

NameNode

Page 50: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 51

DataNode

• Can execute multiple tasks concurrently

• Holds actual data blocks, checksum and generation stamp

• If block is half full, needs only half of the space of full block

• At start-up, connects to NameNode and perform handshake

• No binding to IP address or port, uses Storage ID

• Sends heartbeat to NameNode

DataNode Storage ID:

XYZ001

Page 51: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 52

Communication

• Total Storage Capacity

• Fraction of storage in use

• No of data transfer currently

in progress

• Instructs DataNode

• Replicate block to other node

• Remove local block replica

• Send immediate block report

• Shut down the node

Every 3 seconds.

“I AM ALIVE”

NameNod

e

DataNode Storage ID:

XYZ001 DataNode Storage ID:

XYZ002

DataNode Storage ID:

XYZ003

Reply

No heartbeat for 10 minutes

Heartbeat

Page 52: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 53

Page 53: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Coordination in a distributed system

• Coordination: An act that multiple nodes must perform together.

• Examples:

– Group membership

– Locking

– Publisher/Subscriber

– Leader Election

– Synchronization

• Getting node coordination correct is very hard!

Page 54: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big
Page 55: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers.

Introducing ZooKeeper

- ZooKeeper Wiki

ZooKeeper is much more than a

distributed lock server!

Page 56: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

What is ZooKeeper?

• An open source, high-performance coordination service for distributed applications.

• Exposes common services in simple interface: – naming

– configuration management

– locks & synchronization

– group services

… developers don't have to write them from scratch

• Build your own on it for specific needs.

Page 57: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 58

HDFS Distributions

Page 58: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 59

Real Time BI

• Speed, agility, and intelligence are competitive advantages that nearly all organizations seek.

• Existing Traditional Reporting Systems provide information after 24 – 36 hours.

• To support Operational Users and influence what should happen next, the data should be available in real time to know what is happening now.

Page 60: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

2009 2006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File System)

MapReduce Largely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

° N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clusters

Largely batch system

Difficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-Time Batch

Enabled the

Modern Data

Architecture

October 23, 2013

Page 61: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2015. Apps Associates LLC. 62

Hadoop 2.0

Multi Use Data Platform

Batch, Interactive, Realtime, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage (HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard Query Processing

Hive

Batch MapReduce

Online Data Processing

Interactive Tez

Real Time Stream Processing

Others

Page 62: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 63

Hadoop 2.0 with YARN

Page 63: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 64

Resource Manager/Node Manager Components

Page 64: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 65

Problems with this approach in Hadoop 1.0

It limits scalability: JobTracker runs on single machine doing several task like

1) Resource management

2) Job and task scheduling and

3) Monitoring

Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.

Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.

Distinct map slots and reduce slots

Limitation in running non-MapReduce Application

Page 65: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 66

Yarn Architecture

Rescource Manager:

Arbitrates division of resources among all the applications in the system. The Resource Manager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications

Node Manager:

per-machine slave, runs on slave nodes, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network),and reporting the same to the Resource Manager.

Application Master:

Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress

Container:

Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application (similar to map/reduce slots in MRv1)

Page 66: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 67

Yarn - Execution Sequence

1) A client program submits the application

2) ResourceManager allocates a specified container to start the ApplicationMaster

3) ApplicationMaster, on boot-up, registers with ResourceManager

4) ApplicationMaster negotiates with ResourceManager for appropriate resource containers

5) On successful container allocations, ApplicationMaster contacts NodeManager to launch the container

6) Application code is executed within the container, and then ApplicationMaster is responded with the execution status

7) During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc.

8) Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process

Page 67: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 68

Operational vs. Analytical Databases

Page 68: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 69

A New Technology

Page 69: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

No Means Yes!

Page 70: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 71

Use Cases

Page 71: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 72

Brewer's CAP Theorem

Page 72: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 73

Brewer's CAP Theorem

Page 73: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 74

NoSQL Technology Spectrum

Page 74: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Name Site Counter

Dick Ebay 507,018

Dick Google 690,414

Jane Google 716,426

Dick Facebook 723,649

Jane Facebook 643,261

Jane ILoveLarry.com 856,767

Dick MadBillFans.com 675,230

NameId Name

1 Dick

2 Jane

SiteId SiteName

1 Ebay

2 Google

3 Facebook

4 ILoveLarry.com

5 MadBillFans.com

NameId SiteId Counter

1 1 507,018

1 3 690,414

2 3 716,426

1 3 723,649

2 3 643,261

2 4 856,767

1 5 675,230

Id Name Ebay Google Facebook (other columns) MadBillFans.com

1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230

Id Name Google Facebook (other columns) ILoveLarry.com

2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767

BigTable Data Model

Page 75: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Document databases

• Structured documents – XML and JSON

(JavaScript Object Notation) become more

prevalent within applications

• Web programmers start storing these in BLOBS in

MySQL

• Emergence of XML and JSON databases

Page 76: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Graph Database

Neo4J

Infinite Graph

FlockDB

Document

JSON based

MongoDB

CouchDB

RethinkDB

XML based

MarkLogic

BerkeleyDB XML

Key Value

MemchacheDB

Oracle NoSQL

Dynamo

Voldemort

DynamoDB

Riak

Table Based BigTable

Cassandra

Hbase

HyperTable

Accumulo

Page 77: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 78

Run the Business

Scale-out and scale-up

Collect any data

SQL

Transactional and analytic applications for the enterprise

Secure and highly available

Relational Hadoop

Change the Business

Scale-out, low cost store

Collect any data

Map-reduce, SQL

Analytic applications

NoSQL

Scale the Business

Scale-out, low cost store

Collect key-value data

Find data by key

Web applications

Multiple Data Stores

Page 78: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 79

Data Analytics Challenge

Separate silos of information to analyze

Page 79: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 80

Data Analytics Challenge

Separate data access interfaces

Page 80: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 81

SQL on Hadoop is Obvious

Stinger

Page 81: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 82

Data Analytics Challenge

No comprehensive SQL interface across Oracle, Hadoop and NoSQL

Page 82: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 83

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

NoSQL

Page 83: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 84

What Does Unified Query Mean for You?

After

Data Science

???

Anyone

Before

PhD

Page 84: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 85

What Does Unified Query Mean for You?

After

Application Development

Before

Page 85: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 86

Storage Layer

A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN)

Processing Layer

MapReduce and Hive

Spark Impala Search Big Data

SQL

Page 86: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 87

Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

Page 87: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 88

Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

SQL Push Down in Big Data SQL

• Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation

Page 88: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 89

Query All Data without Application Change or Data Conversion

Oracle Big Data SQL

Page 89: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

INGEST PROCESS

VISUALIZE

ANALYZE

STORE

High Level Architecture

Page 91: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 92

BDD Value Proposition

Note: company logos and images are for illustration purposes only. Not a real use case for the company.

Page 92: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 93

Oracle BDD - Technical Innovation on Hadoop

Oracle Big Data Discovery Workloads

Hadoop Cluster (BDA or Commodity

Hardware)

BDD node

data node

data node

data node

data node

name node Data Processing, Workflow & Monitoring

• Profiling: catalog entry creation, data type &

language detection, schema configuration • Sampling: dgraph (index) file creation • Transforms: >100 functions • Enrichments: location (geo), text (cleanup,

sentiment, entity, key-phrase, whitelist tagging)

Self-Service Provisioning & Data Transfer

• Personal Data: Upload CSV and XLS to HDFS

In-Memory Discovery Indexes

• DGraph: Search, Guided Navigation, Analytics

Studio

• Web UI: Find, Explore, Transform, Discover, Share

Hadoop 2.x

Filesystem (HDFS)

Workload Mgmt (YARN)

Metadata (HCatalog)

Other Hadoop Workloads

MapReduce

Spark

Hive

Pig

Oracle Big Data SQL (BDA only)

Page 93: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 94

Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)

Page 94: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 95

Page 95: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

Thank You! [email protected]

@pasalapudi

https://community.oracle.com/groups/aioug-social-group

Page 96: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big

© Copyright 2016. Apps Associates LLC. 97

www.ora-search.com