big data overview & hadoop for dba’s big data day- bi… · what is the main difference in...
TRANSCRIPT
![Page 1: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/1.jpg)
© Copyright 2016. Apps Associates LLC. 1
Big Data Overview & Hadoop for DBA’s
Satyendra Pasalapudi Associate Practice Director Apps Associates LLC
![Page 2: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/2.jpg)
© Copyright 2016. Apps Associates LLC. 2
About Me
Satyendra Kumar Pasalapudi
Associate Practice Director – Infrastructure/Cloud Practice at Apps Associates
Co-Founder & President of All India Oracle Users Group(AIOUG)
@pasalapudi
![Page 3: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/3.jpg)
© Copyright 2016. Apps Associates LLC. 3
www.ora-search.com
![Page 4: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/4.jpg)
© Copyright 2016. Apps Associates LLC. 4
History of Data Management Systems
Magnetic tape
“flat” (sequential) files
Pre-computer technologies:
Printing press Dewey decimal system Punched cards
Magnetic Disk
IMS
Relational Model defined
Indexed-Sequential Access Mechanism (ISAM)
Network Model
IDMS
ADABAS System R
Oracle V2
Ingres
dBase
DB2
Informix
Sybase
SQL Server
Access
Postgres
MySQL
Cassandra
Hadoop
Vertica
Riak
HBase
Dynamo
MongoDB
Redis
VoltDB
Hana
Neo4J
Aerospike
Hierarchical model
1960-70 1940-50 1950-60 1970-80 1980-90 1990-2000
2000-2010
![Page 5: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/5.jpg)
The Role of Data
is Changing
![Page 6: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/6.jpg)
© Copyright 2016. Apps Associates LLC. 6
Until now, Questions you ask drove Data model
New model is collect as much data as possible – “Data-First Philosophy”
![Page 7: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/7.jpg)
© Copyright 2016. Apps Associates LLC. 7
Data is the new raw material for
any business on par with
capital, people, labor
Data is the new raw material for any business on par
with capital, people, labor
![Page 8: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/8.jpg)
© Copyright 2016. Apps Associates LLC. 8
Characteristics of Big Data
![Page 9: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/9.jpg)
© Copyright 2016. Apps Associates LLC. 9
Cost effectively manage
and analyze
all available data in its
native form
unstructured,
structured, streaming
ERP CRM
RFID
Website
Network Switches
Social Media
Billing
Big data Challenge
![Page 10: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/10.jpg)
© Copyright 2016. Apps Associates LLC. 10
Hybrid Cloud Framework
HR FIN
SCOM SALES
PROCUREMENT
PLANNING
DW / BI
![Page 11: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/11.jpg)
© Copyright 2016. Apps Associates LLC. 11
Big data Eco System
![Page 12: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/12.jpg)
© Copyright 2016. Apps Associates LLC. 12
Not Easy to Get Analytic Value at Fast Enough Pace
1
2
Tool Complexity • Early Hadoop tools only for experts
• Existing BI tools not designed for Hadoop
• Emerging solutions lack broad capabilities
80% effort
typically spent on
evaluating and
preparing data
Data Uncertainty • Not familiar and overwhelming
• Potential value not obvious
• Requires significant manipulation
Overly dependent
on scarce and
highly skilled
resources
Source : Oracle
![Page 13: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/13.jpg)
© Copyright 2016. Apps Associates LLC. 13
Informatica Study May 2013
Addressed by Oracle Big Data Discovery
Key Challenges in Managing Big Data
![Page 14: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/14.jpg)
© Copyright 2016. Apps Associates LLC. 14
Sample of Big Data Use Cases Today
MEDIA/ ENTERTAINMENT
Viewers / advertising effectiveness Cross Sell
COMMUNICATIONS
Location-based advertising
EDUCATION & RESEARCH
Experiment sensor analysis
Retail / CPG
Sentiment analysis Hot products
Optimized Marketing
HEALTH CARE
Patient sensors, monitoring, EHRs Quality of care
LIFE SCIENCES
Clinical trials Genomics
HIGH TECHNOLOGY / INDUSTRIAL MFG.
Mfg quality Warranty analysis
OIL & GAS
Drilling exploration sensor analysis
FINANCIAL SERVICES
Risk & portfolio analysis New products
AUTOMOTIVE
Auto sensors reporting location, problems
Games
Adjust to player behavior In-Game Ads
LAW ENFORCEMENT & DEFENSE
Threat analysis - social media monitoring, photo analysis
TRAVEL & TRANSPORTATION
Sensor analysis for optimal traffic flows Customer sentiment
UTILITIES
Smart Meter analysis for network capacity,
ON-LINE SERVICES / SOCIAL MEDIA
People & career matching Web-site
optimization
What is the main difference in this data?
Volume, Velocity, Variety
These Characteristics Challenge Your Existing Architecture
![Page 15: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/15.jpg)
© Copyright 2016. Apps Associates LLC. 15
Big Data Verticals
Media/Advertising
Targeted Advertisin
g
Image and Video Processin
g
Oil & Gas
Seismic Analysis
Retail
Recommend
Transactions
Analysis
Life Sciences
Genome Analysis
Financial Services
Monte Carlo
Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image Recogniti
on
Social Network/Gaming
User Demograp
hics
Usage analysis
In-game metrics
![Page 16: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/16.jpg)
© Copyright 2016. Apps Associates LLC. 16
Sample Enterprise Big Data Architecture
Operational RDBMS (Oracle, SQL Server, …)
In-memory Analytics (HANA,
Exalytics …)
In-memory processing
(Spark)
Hadoop
Web DBMS (MySQL, Mongo,
Cassandra)
ERP & in-house CRM
Analytic/BI software (SAS,
Tableau
Web Server Data
Warehouse RDBMS
(Oracle, Teradata …)
![Page 17: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/17.jpg)
© Copyright 2016. Apps Associates LLC. 17
Enterprise Data Hub / Data Lake / Data Reservoir
![Page 18: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/18.jpg)
© Copyright 2016. Apps Associates LLC. 18
Hadoop Data Reservoir Momentum
1
8
Hadoop Revenue and Forecast 49% CAGR, 2013-2018
Big Data Infrastructure Market $20.7b in 2018
Big Data Software
Market
$9b in 2018
Data Warehouse
Existing Sources Emerging Sources
Data Reservoir Data Warehouse
Source : Oracle
![Page 19: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/19.jpg)
© Copyright 2016. Apps Associates LLC. 19
Traditional Systems Under Pressure
AP
PLI
CA
TIO
NS
DA
TA S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
• Silos of Data • Costly to Scale • Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data (IoT)
Unstructured docs, emails
Server logs
SOU
RC
ES
Existing Sources (CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to manage new data
![Page 20: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/20.jpg)
© Copyright 2016. Apps Associates LLC. 20
Hadoop Enabled the Modern Data Architecture
Common Data Set, Multiple Applications • Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing & segmentation of data
YARN: Architectural Center of Hadoop • Consistent security, governance &
operations
• Ecosystem applications run natively in Hadoop
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geolocation Sensor & Machine
Server Logs
Unstructured
AP
PLI
CA
TIO
NS
DA
TA S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-Time Batch
![Page 21: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/21.jpg)
© Copyright 2016. Apps Associates LLC. 21
z BY INDUSTRY & LINE OF BUSINESS
BIG
DA
TA
A
PP
LIC
AT
ION
S
DISCOVERY
BU
SIN
ES
S
AN
ALY
TIC
S
BUSINESS ANALYTICS
DATA RESERVOIR
BIG
DA
TA
M
AN
AG
EM
EN
T
DATA WAREHOUSE
SO
UR
CE
S
Big Data Footprint & Scope of Architecture
![Page 22: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/22.jpg)
© Copyright 2016. Apps Associates LLC. 22
Architecture Vision Common Emerging Platform Pattern
Data Warehouse / Data Marts
Business Intelligence Tools
ERP, CRM & Other Transactional Apps
Historic Source of Truth Reporting, Query and
Analysis Tools
Information Discovery Engine
Advanced Analytics
Website Logs & Data NoSQL DB
Sensors
Hadoop High Volume Distributed File System
Structured Data
Semi-structured Data
Real-Time Analytics and Recommendations
Recommend Location & User Profile
R, SAS
Discoveries
![Page 23: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/23.jpg)
© Copyright 2016. Apps Associates LLC. 23
Potential Oracle Products in the Footprint
Endeca Information Discovery on Exalytics
Cloudera HDFS on Big Data Appliance
Reliable, Available, Secure Source of Truth
Fast, Intuitive Data Discovery
Website Logs & Data Oracle NoSQL
DB
Real-time Recommendations
Analyst Friendly Reporting Query & Analysis Tools
Unstructured Data Analysis
Sensors
Oracle Database DW on Exadata
Oracle BI Foundation Suite, Hyperion on
Exalytics
Oracle ERP & CRM Solutions on Exadata
Oracle Real-Time Decisions
Structured Data Analysis
Big Data Connectors
ODI
OEP
Advanced Analytics, In-Memory, Big
Data SQL
R, SAS
![Page 24: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/24.jpg)
© Copyright 2016. Apps Associates LLC. 24
Oracle’s Unified Information Management
![Page 25: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/25.jpg)
© Copyright 2016. Apps Associates LLC. 25
Oracle Big Data Management System
SOU
RC
ES
Oracle Database
Oracle Industry Models
Oracle Advanced
Analytics
Oracle Spatial & Graph
Big Data Appliance
Cloudera Hadoop
Oracle NoSQL Database
Oracle R Advanced Analytics for Hadoop
Oracle R Distribution
Oracle Database
Oracle Advanced Security
Oracle Advanced
Analytics
Oracle Spatial & Graph
Oracle Exadata
Oracle Big Data Connectors
Oracle Data Integrator
B
Oracle Big Data SQL
![Page 26: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/26.jpg)
© Copyright 2016. Apps Associates LLC. 26
Oracle Big Data Management System
![Page 27: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/27.jpg)
We Need Tools Built Specifically
for Big Data
![Page 28: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/28.jpg)
© Copyright 2016. Apps Associates LLC. 29
Hadoop and it’s Eco System
• Scale out Easily
• Parallel Computing
• Commodity Hardware
• Solves some Problems
• Complex to Run
• Special Skills to Maintain
Cassandra
![Page 29: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/29.jpg)
© Copyright 2016. Apps Associates LLC. 30
ETL for Unstructured Data
![Page 30: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/30.jpg)
© Copyright 2016. Apps Associates LLC. 31
ETL for Structured Data
![Page 31: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/31.jpg)
© Copyright 2016. Apps Associates LLC. 32
Hadoop Design Principles
• System shall manage and heal itself
– Automatically and transparently route around failure
– Speculatively execute redundant tasks if certain nodes are detected to be slow
• Performance shall scale linearly
– Proportional change in capacity with resource change
• Compute should move to data
– Lower latency, lower bandwidth
• Simple core, modular and extensible
![Page 32: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/32.jpg)
© Copyright 2016. Apps Associates LLC. 33
Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• May 2009 – Hadoop sorts Petabyte in 17 hours
![Page 33: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/33.jpg)
Google File System (GFS)
Map Reduce BigTable
Google Applications
Google Software Architecture (circa 2005)
![Page 34: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/34.jpg)
Start Reduce Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Reduce
![Page 35: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/35.jpg)
© Copyright 2016. Apps Associates LLC. 36
Hadoop Ecosystem
HDFS (Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Data Access
Sqoop
Flume
Client Access
Hue
Hive(Sql)
Pig(Pl/Sql)
Zo
oK
ee
pe
r
(Coo
rdin
atio
n)
(Streaming/Pipes APIs)
Ch
ukw
a (
Mo
nito
rin
g)
Data Mining
Mahout
OS – Redhat, Suse, Ubuntu,Windows
Commodity Hardware
Java Virtual Machine
Networking
Orchestration
Oozie
![Page 36: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/36.jpg)
© Copyright 2016. Apps Associates LLC. 37
Hadoop – Simplified View
• MPP (Massively Parallel) hardware running database-like software
• “Data” is stored in parts, across multiple worker nodes
• “Work” operates in parallel, on the different parts of the table
Controller Worker Nodes
![Page 37: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/37.jpg)
© Copyright 2016. Apps Associates LLC. 38
HDFS Architecture
![Page 38: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/38.jpg)
HDFS Architecture
Namenode
B replication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..
Block ops
![Page 39: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/39.jpg)
© Copyright 2016. Apps Associates LLC. 40
Head Node Data 1 Data 2 Data 3 Data 4
MYFILE.TXT
..block1 -> block1
..block2 -> block2
..block3 -> block3
HDFS – Highly Available
![Page 40: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/40.jpg)
© Copyright 2016. Apps Associates LLC. 41
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
![Page 41: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/41.jpg)
Hadoop 1 – Job & Task Trackers
Master Node - The majority of hadoop deployments consist of sevaral master node
instances. Having more than one master node helps eliminate the risk of single
point of failure.
NameNode - These processes are charged with storing a directory tree of all files
in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the
file data is kept within in the cluster. Client Applications contact Name Nodes when
they need to locate a file, or add, or copy or delete a file.
DataNodes - The datanode stores data in the HDFS and is responsible for
replicating data across clusters. Data Nodes interact with client applications when
the NameNopde has supplied the Datanode's address.
WorkerNode: Unlike a master node, whose numbers we can count on one hand, a
representative Hadoop Deployment consists of dozens or hundreds of worker
nodes, which provides enough processing power to analyze a
few hundreds terabytes all the way upto one petabyte. Each worker node includes
a DataNode as well as Task Tracker.
![Page 42: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/42.jpg)
Map Reduce
Job Tracker /MapReduce Workload Management Layer - This
process is assigned to interact with client applications. It is
responsible for distributing MapReduce tasks to particular nodes
within in a cluster. This engine coordinates all aspects of hadoop
such as scheduling and launching jobs.
Task Tracker - This is a process in the cluster that is capable of
receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job
Tracker
![Page 43: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/43.jpg)
© Copyright 2016. Apps Associates LLC. 44
Data Replication Similar to that of ASM
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
BlockReport contains all the blocks on a Datanode.
![Page 44: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/44.jpg)
© Copyright 2016. Apps Associates LLC. 45
Replica Placement & Rack Aware
The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement:
Goal: improve reliability, availability and network bandwidth utilization
Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks
Simple but non-optimal Writes are expensive Replication factor is 3
Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.
![Page 45: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/45.jpg)
© Copyright 2016. Apps Associates LLC. 46
Replica Selection
• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.
![Page 46: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/46.jpg)
© Copyright 2016. Apps Associates LLC. 47
Hadoop Components
• Hadoop is bundled with two independent components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance
![Page 47: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/47.jpg)
© Copyright 2016. Apps Associates LLC. 48
Understanding file structure
1 GB file
File is split into
blocks
Each block is typically 64MB
Each block is stored as two files – one holding
data and second for metadata, checksum
Bloc
k
![Page 48: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/48.jpg)
© Copyright 2016. Apps Associates LLC. 49
Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker
![Page 49: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/49.jpg)
© Copyright 2016. Apps Associates LLC. 50
NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of DataNodes and tens of thousands of HDFS client.
NameNode
![Page 50: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/50.jpg)
© Copyright 2016. Apps Associates LLC. 51
DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and generation stamp
• If block is half full, needs only half of the space of full block
• At start-up, connects to NameNode and perform handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode Storage ID:
XYZ001
![Page 51: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/51.jpg)
© Copyright 2016. Apps Associates LLC. 52
Communication
• Total Storage Capacity
• Fraction of storage in use
• No of data transfer currently
in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3 seconds.
“I AM ALIVE”
NameNod
e
DataNode Storage ID:
XYZ001 DataNode Storage ID:
XYZ002
DataNode Storage ID:
XYZ003
Reply
No heartbeat for 10 minutes
Heartbeat
![Page 52: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/52.jpg)
© Copyright 2016. Apps Associates LLC. 53
![Page 53: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/53.jpg)
Coordination in a distributed system
• Coordination: An act that multiple nodes must perform together.
• Examples:
– Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
![Page 54: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/54.jpg)
![Page 55: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/55.jpg)
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers.
Introducing ZooKeeper
- ZooKeeper Wiki
ZooKeeper is much more than a
distributed lock server!
![Page 56: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/56.jpg)
What is ZooKeeper?
• An open source, high-performance coordination service for distributed applications.
• Exposes common services in simple interface: – naming
– configuration management
– locks & synchronization
– group services
… developers don't have to write them from scratch
• Build your own on it for specific needs.
![Page 57: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/57.jpg)
© Copyright 2016. Apps Associates LLC. 58
HDFS Distributions
![Page 58: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/58.jpg)
© Copyright 2016. Apps Associates LLC. 59
Real Time BI
• Speed, agility, and intelligence are competitive advantages that nearly all organizations seek.
• Existing Traditional Reporting Systems provide information after 24 – 36 hours.
• To support Operational Users and influence what should happen next, the data should be available in real time to know what is happening now.
![Page 59: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/59.jpg)
© Copyright 2016. Apps Associates LLC. 60
Hadoop 2.0
![Page 60: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/60.jpg)
2009 2006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduce Largely Batch Processing
Hadoop w/ MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
° N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-Time Batch
Enabled the
Modern Data
Architecture
October 23, 2013
![Page 61: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/61.jpg)
© Copyright 2015. Apps Associates LLC. 62
Hadoop 2.0
Multi Use Data Platform
Batch, Interactive, Realtime, Online, Streaming, …
HADOOP 2
Redundant, Reliable Storage (HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Standard Query Processing
Hive
Batch MapReduce
Online Data Processing
Interactive Tez
Real Time Stream Processing
Others
![Page 62: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/62.jpg)
© Copyright 2016. Apps Associates LLC. 63
Hadoop 2.0 with YARN
![Page 63: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/63.jpg)
© Copyright 2016. Apps Associates LLC. 64
Resource Manager/Node Manager Components
![Page 64: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/64.jpg)
© Copyright 2016. Apps Associates LLC. 65
Problems with this approach in Hadoop 1.0
It limits scalability: JobTracker runs on single machine doing several task like
1) Resource management
2) Job and task scheduling and
3) Monitoring
Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.
Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.
Distinct map slots and reduce slots
Limitation in running non-MapReduce Application
![Page 65: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/65.jpg)
© Copyright 2016. Apps Associates LLC. 66
Yarn Architecture
Rescource Manager:
Arbitrates division of resources among all the applications in the system. The Resource Manager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications
Node Manager:
per-machine slave, runs on slave nodes, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network),and reporting the same to the Resource Manager.
Application Master:
Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress
Container:
Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application (similar to map/reduce slots in MRv1)
![Page 66: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/66.jpg)
© Copyright 2016. Apps Associates LLC. 67
Yarn - Execution Sequence
1) A client program submits the application
2) ResourceManager allocates a specified container to start the ApplicationMaster
3) ApplicationMaster, on boot-up, registers with ResourceManager
4) ApplicationMaster negotiates with ResourceManager for appropriate resource containers
5) On successful container allocations, ApplicationMaster contacts NodeManager to launch the container
6) Application code is executed within the container, and then ApplicationMaster is responded with the execution status
7) During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc.
8) Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process
![Page 67: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/67.jpg)
© Copyright 2016. Apps Associates LLC. 68
Operational vs. Analytical Databases
![Page 68: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/68.jpg)
© Copyright 2016. Apps Associates LLC. 69
A New Technology
![Page 69: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/69.jpg)
No Means Yes!
![Page 70: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/70.jpg)
© Copyright 2016. Apps Associates LLC. 71
Use Cases
![Page 71: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/71.jpg)
© Copyright 2016. Apps Associates LLC. 72
Brewer's CAP Theorem
![Page 72: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/72.jpg)
© Copyright 2016. Apps Associates LLC. 73
Brewer's CAP Theorem
![Page 73: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/73.jpg)
© Copyright 2016. Apps Associates LLC. 74
NoSQL Technology Spectrum
![Page 74: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/74.jpg)
Name Site Counter
Dick Ebay 507,018
Dick Google 690,414
Jane Google 716,426
Dick Facebook 723,649
Jane Facebook 643,261
Jane ILoveLarry.com 856,767
Dick MadBillFans.com 675,230
NameId Name
1 Dick
2 Jane
SiteId SiteName
1 Ebay
2 Google
3 Facebook
4 ILoveLarry.com
5 MadBillFans.com
NameId SiteId Counter
1 1 507,018
1 3 690,414
2 3 716,426
1 3 723,649
2 3 643,261
2 4 856,767
1 5 675,230
Id Name Ebay Google Facebook (other columns) MadBillFans.com
1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230
Id Name Google Facebook (other columns) ILoveLarry.com
2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767
BigTable Data Model
![Page 75: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/75.jpg)
Document databases
• Structured documents – XML and JSON
(JavaScript Object Notation) become more
prevalent within applications
• Web programmers start storing these in BLOBS in
MySQL
• Emergence of XML and JSON databases
![Page 76: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/76.jpg)
Graph Database
Neo4J
Infinite Graph
FlockDB
Document
JSON based
MongoDB
CouchDB
RethinkDB
XML based
MarkLogic
BerkeleyDB XML
Key Value
MemchacheDB
Oracle NoSQL
Dynamo
Voldemort
DynamoDB
Riak
Table Based BigTable
Cassandra
Hbase
HyperTable
Accumulo
![Page 77: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/77.jpg)
© Copyright 2016. Apps Associates LLC. 78
Run the Business
Scale-out and scale-up
Collect any data
SQL
Transactional and analytic applications for the enterprise
Secure and highly available
Relational Hadoop
Change the Business
Scale-out, low cost store
Collect any data
Map-reduce, SQL
Analytic applications
NoSQL
Scale the Business
Scale-out, low cost store
Collect key-value data
Find data by key
Web applications
Multiple Data Stores
![Page 78: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/78.jpg)
© Copyright 2016. Apps Associates LLC. 79
Data Analytics Challenge
Separate silos of information to analyze
![Page 79: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/79.jpg)
© Copyright 2016. Apps Associates LLC. 80
Data Analytics Challenge
Separate data access interfaces
![Page 80: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/80.jpg)
© Copyright 2016. Apps Associates LLC. 81
SQL on Hadoop is Obvious
Stinger
![Page 81: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/81.jpg)
© Copyright 2016. Apps Associates LLC. 82
Data Analytics Challenge
No comprehensive SQL interface across Oracle, Hadoop and NoSQL
![Page 82: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/82.jpg)
© Copyright 2016. Apps Associates LLC. 83
Oracle Big Data Management System
Rich, comprehensive SQL access to all enterprise data
NoSQL
![Page 83: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/83.jpg)
© Copyright 2016. Apps Associates LLC. 84
What Does Unified Query Mean for You?
After
Data Science
???
Anyone
Before
PhD
![Page 84: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/84.jpg)
© Copyright 2016. Apps Associates LLC. 85
What Does Unified Query Mean for You?
After
Application Development
Before
![Page 85: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/85.jpg)
© Copyright 2016. Apps Associates LLC. 86
Storage Layer
A New Hadoop Processing Engine
Filesystem (HDFS) NoSQL Databases
(Oracle NoSQL DB, Hbase)
Resource Management (YARN)
Processing Layer
MapReduce and Hive
Spark Impala Search Big Data
SQL
![Page 86: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/86.jpg)
© Copyright 2016. Apps Associates LLC. 87
Big Data SQL
SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;
Relevant SQL runs on BDA nodes
10’s of Gigabytes of Data
Only columns and rows needed to answer query are returned
Hadoop Cluster
B B B
Big Data SQL
Oracle Database
CUSTOMERS WEB_LOGS
![Page 87: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/87.jpg)
© Copyright 2016. Apps Associates LLC. 88
Big Data SQL
SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;
Relevant SQL runs on BDA nodes
10’s of Gigabytes of Data
Only columns and rows needed to answer query are returned
Hadoop Cluster
B B B
Big Data SQL
Oracle Database
CUSTOMERS WEB_LOGS
SQL Push Down in Big Data SQL
• Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation
![Page 88: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/88.jpg)
© Copyright 2016. Apps Associates LLC. 89
Query All Data without Application Change or Data Conversion
Oracle Big Data SQL
![Page 89: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/89.jpg)
INGEST PROCESS
VISUALIZE
ANALYZE
STORE
High Level Architecture
![Page 90: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/90.jpg)
© Copyright 2016. Apps Associates LLC. 91
Fast Pace Innovation
Dec 18th 2015
http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
![Page 91: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/91.jpg)
© Copyright 2016. Apps Associates LLC. 92
BDD Value Proposition
Note: company logos and images are for illustration purposes only. Not a real use case for the company.
![Page 92: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/92.jpg)
© Copyright 2016. Apps Associates LLC. 93
Oracle BDD - Technical Innovation on Hadoop
Oracle Big Data Discovery Workloads
Hadoop Cluster (BDA or Commodity
Hardware)
BDD node
data node
data node
data node
data node
name node Data Processing, Workflow & Monitoring
• Profiling: catalog entry creation, data type &
language detection, schema configuration • Sampling: dgraph (index) file creation • Transforms: >100 functions • Enrichments: location (geo), text (cleanup,
sentiment, entity, key-phrase, whitelist tagging)
Self-Service Provisioning & Data Transfer
• Personal Data: Upload CSV and XLS to HDFS
In-Memory Discovery Indexes
• DGraph: Search, Guided Navigation, Analytics
Studio
• Web UI: Find, Explore, Transform, Discover, Share
Hadoop 2.x
Filesystem (HDFS)
Workload Mgmt (YARN)
Metadata (HCatalog)
Other Hadoop Workloads
MapReduce
Spark
Hive
Pig
Oracle Big Data SQL (BDA only)
![Page 93: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/93.jpg)
© Copyright 2016. Apps Associates LLC. 94
Sample Enterprise Big Data Architecture
Operational RDBMS (Oracle, SQL Server, …)
In-memory Analytics (HANA,
Exalytics …)
In-memory processing
(Spark)
Hadoop
Web DBMS (MySQL, Mongo,
Cassandra)
ERP & in-house CRM
Analytic/BI software (SAS,
Tableau
Web Server Data
Warehouse RDBMS
(Oracle, Teradata …)
![Page 94: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/94.jpg)
© Copyright 2016. Apps Associates LLC. 95
![Page 96: Big Data Overview & Hadoop for DBA’s Big Data Day- Bi… · What is the main difference in this data? Volume, Velocity, ... Usage analysis In-game ... Structured Data Analysis Big](https://reader033.vdocuments.us/reader033/viewer/2022052608/5a9482c27f8b9ab6188bdab5/html5/thumbnails/96.jpg)
© Copyright 2016. Apps Associates LLC. 97
www.ora-search.com