next generation hadoop introduction
TRANSCRIPT
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive InnovaDon in the plaEorm – We lead the roadmap
100% Open Source – DemocraDzed Access to Data
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
Volume Volume Volume
Volume Volume
Volume
Volume
The soluDon? EDW
Data Data Data Data
Data Data
Data Data Data
Yet Another EDW
Data Data Data Data
Data Data
Data Data Data
AnalyDcal DB
Data Data Data Data
Data Data
Data Data Data OLTP
Data Data Data Data
Data Data
Data Data Data
Another EDW
Data Data Data Data
Data Data
Data Data Data
Ummm…you dropped something
Data Data Data Data
Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
EDW
Data Data Data Data
Data Data
Data Data Data
Yet Another EDW
Data Data Data Data
Data Data
Data Data Data
AnalyDcal DB
Data Data Data Data
Data Data
Data Data Data
OLTP
Data Data Data Data
Data Data
Data Data Data
Another EDW
Data Data Data Data
Data Data
Data Data Data
Data Silos. Your data silos are lonely places.
EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
Data likes to socialize too. EDW
Data Data Data Data
Data Data
Data Data Data
Accounts
Data Data Data Data
Data Data
Data Data Data
Customers
Data Data Data Data
Data Data
Data Data Data
Web ProperDes
Data Data Data Data
Data Data
Data Data Data
Machine Data
Data Data Data Data
Data Data
Data Data Data
TwiWer
Data Data Data Data
Data Data
Data Data Data
Data Data Data Data
Data Data
Data Data Data
CDR
Data Data Data Data
Data Data
Data Data Data
Weather Data
Data Data Data Data
Data Data
Data Data Data
New types of data don’t quite fit into your prisDne view of the world.
My LiWle Data Empire
Data Data Data
Data Data Data
Data Data Data
Logs
Data Data Data Data
Data Data Data
Machine Data
Data Data Data Data
Data Data Data
? ? ? ?
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data Data
Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data Data
Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data Data
Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
What if the data was processed and stored centrally? What if you didn’t need to force it into a single schema?
We call it a Modern Data Architecture*
*AKA Data Lake
A Modern Data Architecture • Consolidate siloed data sets structured
and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APPLICAT
IONS
DATA
SYSTEM
Business Analy>cs
Custom Applica>ons
Packaged Applica>ons
RDBMS
EDW
MPP
YARN: Data Opera>ng System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOURC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca>on Sensor & Machine
Server Logs
Unstructured
HDFS stores data in blocks and replicates those blocks
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3 block3
block3
block2 block2
block2
block1
block1
block1
If a block fails then HDFS always has the other copies and heals itself
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3
block3
block3
block2 block2
block2
block1
block1
block1
X
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
YARN abstracts resource management so you can run more
than just MapReduce
HDFS2
MapReduce V2
YARN Slider
Storm MPI Giraph Tez … and more Spark
YARN is like the OS, and the processing frameworks are Apps. Slider, Tez, and Spark are all significant distributed processing frameworks that host a mulDtude of Data
Tools.
Yet Another Resource NegoDator
Slider Tez Spark HBase Storm Cascading Hive Pig SQL Streaming MLLib
Your segmentaDon today.
Male
Female
Age: 25-‐30
Town/City
Middle Income Band
Product Category Preferences
Your segmentaDon with beWer data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-‐68k per year
Product recommendaDons per Dme of day and per weather
Tea Party Hippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products lep in basket indicate drunk amazon shopper
Purchase history indicates a risk taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
Don’t wait for your data. Batch is open too late to influence the person who
is in your store or on your website right now.
Streaming Processing, Search, and Storage
Hortonworks Data PlaOorm 2.2
YARN
HDFS
APACHE KAFKA
Search Solr Slider
Online Data Processing
HBase
Real Time Stream Processing
Storm SQL Hive
Streaming Ingest
Stream data into Hadoop and process it in near real-‐Dme
Real-‐Dme data feeds
Hortonworks Data Platform: A comprehensive data management platform
Hortonworks Data PlaOorm 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
HDP = Apache Hadoop
Hortonworks Data PlaOorm 2.2
Had
oop
&YA
RN
Pig
Hive & HCatalog
HBa
se
Sqo
op
Oozie
Zoo
keep
er
Amba
ri
Storm
Flume
Kno
x
Pho
enix
Accum
ulo
2.2.0 0.12.0
0.12.0 2.4.0
0.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.1 1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5 0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ran
ger
Spa
rk
Kaa
a
0.14.0 0.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.0 0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0 0.5.0
0.4.0 2.6.0
3.4.5
Tez
0.4.0
Slid
er
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access Governance & Integration Security Operations