Download - Tame that Beast
1© Copyright 2016 EMC Corporation. All rights reserved.
TAME THAT BEASTStefan RadtkeCTO, EMEAEMC Emerging Technology Division
2EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Welcome !Dr. Stefan RadtkeCTO Isilon, EMEAEMC Emerging Technology Division
- 1995-2011: 17 Years for IBM in various technical roles- 2011: Joined EMC- 2012-today: CTO, EMEA for EMC Insilon
Phone: +49-176-34434460E-Mail: [email protected]: http://de.linkedin.com/in/drstefanradtkeBlog: http://stefanradtke.blogspot.com
3EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
System AvailabilityUptime Downtime (per year)99.999% (AKA 5 nines) 5.26 minutes99.99% (AKA 4 nines) 52.6 minutes99.5% 1.83 days99% (AKA 2 nines) 7.30 days95% 18.25 days
What is your Data Warehouses’ uptime SLA?What is your Hadoop uptime SLA?
Why are they different?
4EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
We have good Hadoop Outcomes
Smart Grid Fraud / Broken Devices & Grid Traffic Projections Fraud Healthcare research Genomes and Healthcare – BRCA Connected Car - Tesla
5EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Hadoop takes on DB like Features• Newly Added Features in Hadoop 3.0
– Erasure Coding (HDFS-EC / HDFS-7485) is being introduced to Hadoop
– Additional Stand By Name Nodes for increase resiliency (HDFS-6440)
• Future Features– Random read support from Indexed Name Node – (
HDFS-8555)– Disaster Recovery (HDFS-5442)
6EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
So...• IF Hadoop is the Modern DatabaseAND
• IF Hadoop is taking on more Modern Database FeaturesAND
• Successful Outcomes are becoming more prolific...
Why are Operations of Hadoop and Uptime / SLAs seem like such an afterthought on most clusters?
7EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
KPIs• Why do companies who have VERY successful Data
Warehouses, ETL processes, and KPI Dashboards have so little of THOSE for their Hadoop instance which is now generating all their Machine Learning and Data & Analytics?
8EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What can go wrong?• Forbes:
“..haven’t taken into account some long-term or ongoing cost associated with the project…”
• Information Week: “…Unanticipated problems beyond the big data technology…”
• Computerworld: “…there are enterprises that underestimated the paradigm shift…”
9EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
An Intervention• Why is the concept of 99.99% seem bad for a
production Hadoop system?• Why is solid KPIs around data collection and capture
sound absurd?• Since when did a backup copy or backup of your
primary analytics data become not needed?• Is this just because Hadoop is about standing up cheap
hardware?• Why do companies need a catalyst before these things
seem common again?
10EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Why wouldn’t you want:
• Two clusters fully addressable with data replication located in separate geographies
• Data Re-silvering when additional capacity is added
• Complete fault tolerance in the environment and not just Data / Node redundancy to allow 4 Nines availability
• Operational scale that allows 24 x 7 support
EMPT
YEM
PTY
EMPT
YEM
PTY
EMPT
YFU
LLFU
LLFU
LLFU
LLBA
LANC
EDBA
LANC
EDBA
LANC
EDBA
LANC
EDBA
LANC
ED
11EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 1• Separation of compute and storage.
– Why do you think the cloud Hadoop is able to offer better SLAs then on premise Hadoop? It isn’t because of a ton of single point of failure compute boxes. They separate compute and storage.
• Look at Infrastructure / Big Data as a service centralization– Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize
the team and provide QoS back to the applications
12EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Data Gravity• Data sets get bigger over time, and moving them becomes
increasingly difficult– This leads to switching costs & lock in
• Data is a strategic asset to enterprises with digital strategies• Data becomes central – build around it
– Applications tend to migrate toward the data– Apply advanced analytics to the data “in-place”
13EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Multiple Hadoop Silos
Storage Silos
vServer
Applications
Finance Marketing Operations Sales
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
CRMERP SCM CRM Servers
Storage
Servers
Storage
Servers
Storage
Analytics
Copy
Copy
Traditional IT
14EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
THE PROBLEM OF DATA MOVEMENT
• To get statistically relevant results, a typical minimal required data set is about 100 TB.
• That’s also the recommendet minimal Hadoop cluster size
• To copy 100TB over a dedicated 10 GBE link takes about 24 hours.
You need a Data Lake that unserstands Posix/Windows and HDFS to avoid data movement (=In-place Analytics)
15EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
EMC DATA LAKE
Isilon
Servers
Applications
Finance Marketing Operations Sales
Servers Servers Servers Servers
CRMERP SCM CRM
Servers Servers Servers
Analytics + Mobile Applications
• Data Lake
Servers Servers Servers Servers
16EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
WHAT IS A DATA LAKE?A Data Lake is scale-out storage for data consolidation. It allows for Big Data accessibility via traditional and next generation access methods to enable in-place analytics .
17EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Isilon Data Lake Architecture
ClientsC
LAN
CClients
Clients
Isilon Node
GB/10GBEthernet
Isilon
SAS
Isilon Node
SAS
Isilon Node
SASInfiniband
Scale out Data Lake OneFS integrates RAID, Volume Manager and
Filesystem. Uses internal disk and spawns a single
filesystem accross disks Development start in the 2000‘s Extremly mature, based on FreeBSD Supports many access protocols
…
Scale Out
ClientsClients
LAN
18EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
• Multi-threaded daemon runs on all nodes– Services both NN and DN protocols– Translates HDFS RPCs to POSIX system calls– Stateless, underlying FS handles coherency
HDFS Implementation as a Protocol
OneFS Node
isi_hdfs_d
ThreadRequest VFS
OneFSSyscall
Response
19EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
HDFS IMPLEMENTED LIKE A NAS PROTOCOL
OneFS runs a daemon that speaks NameNode and DataNode natively
OneFS Clustered FileSystemOneFS Node
NameNodeDataNode
OneFS Node
NameNodeDataNode
OneFS Node
NameNodeDataNode OneFS
Node
NameNodeDataNode
Hadoop Node
DFSClient1) Request(“/file”)
2) Response (block locations) 3) GetBlock(block)
20EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
ISILON - FOR ALL TYPES OF UNSTRUCTURED DATA
Archive &Backup Target
File shares Home
Directories
BLOBS
Design, Test & Manufacture Retail &
Monetization
Transaction
Hadoop & Analytics
Sync ‘n Share
Application Test
Content
Social &Next-Gen
Surveillance
Isilon Data Lake
© Copyright 2016 EMC Corporation. All rights reserved.
21EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
HDFS
SMB, NFS, HTTP, FTP, HDFS 1.x
...HDFS 2.x
...name node
name node
name node
name node data node
NFS
SMB
SMB
NFS MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
SUPPORT FOR MULTIPLE ANALYTICS APPLICATIONS
22EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY© Copyright 2015 EMC Corporation. All rights reserved.
DATA CENTER
CLOUDPOOLS
SmartPools Policy Example
<30 days
>30 days
S210
NL410
>2 years Cloud
22
EXPAND DATA LAKE TO THE CLOUD
30 days-1 year
> 1 year HD400
CLOUD PROVIDER
1 year – 2 years
23EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
CLOUDPOOLSDATA CENTER
23
CLOUD PROVIDER
APPS &USERS
Access time
CLOUD ENABLED DATA LAKE
24EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Parallel Replication Designed ground-up for scale-out storage Aggregate throughput scales with capacity Maintain consistent RPO over growing data sets Underlying FS knowledge
– Snapshot integration– Block-level deltas– Rich meta-data transfer
Automated Data Failover/Failback
25EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Storage ConsiderationsSTANDARD HADOOP CLUSTER
HADOOP USING EMC ISILON DATA LAKE
100 Nodes Compute + DAS24 TB per Node
/3 for Hadoop Copies
800TB Usable, but rarely achieved
5+ Cabinets
Spill space for ingestion and extraction
20 NodesCompute + 800TB Isilon
Single Copy withErasure Coding
800TB Usable
1 Cabinet It is NAS
26EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 2• Build a fully functioning cost model that includes all
items you think are “free”, but costs stop when you change the Architecture.
– Project based funding is great until you want to centralize. Centralization models (BDaaS) work when you consider all the sundry costs typically excluded by project based funding (i.e., 24 x 7 support for each cluster, all in costs that appear free but are sunk)
27EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 3• Think about “build all yourself” vs. “buy” • Focus on Analytics rather than infrastructure implementation,
software dependency, testing,.... etc.• That has all been done already with EMC Big Data Systems and
Big Data Solutions• Using pre-validated, installed and tested solutions reduces
complexity and increases reliability.
28EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
EMC BIG DATA PORTFOLIO
• Data Lake• Data Lake Extensions• Cloud Enabled
• Vblock• VxRack• VxRail
• Federation Business Data Lake
29EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
HIGH PERFORMANCEPREDICTABLE, LOW LATENCY
HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIe
SATA
PCIe
10 ms HDD
1000- 2000 µs HDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
HDFS
PCIe
< 100 µs
DSSD
✓HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIe
SATA
PCIe
10 ms HDD
1000- 2000 µs SDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
DSSD Hadoop Plugin accesses
flash directly• 10X Throughput• 1/13th Latency• No Application
Changes Required
30EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
P I V O T A L B I G D A T A S U I T E
V M W A R E V C L O U D S U I T E
EMC DATA LAKE FOUNDATION: ISILON + ECSVCE VBLOCK | XTREMIO | DATA DOMAIN
O P E NA N A L Y T I C S T O O L B O X
D A T A A N D A N A L Y T I C S C A T A L O G
A D V A N C E D A N A L Y T I C SA P P L I C A T I O N SA T S C A L E
D A T A P R O C E S S I N G
GREENPLUMDATABASE HAWQ
SPRING XD PIVOTAL HDSPARK
REDIS
RABBITMQ
GEMFIRE
BDS ON PIVOTAL CLOUD FOUNDRY
H A D O O P
PL
AT
FO
RM
MA
NA
GE
R DA
TA G
OV
ER
NO
RDA
TA M
ANAG
ERIN
GEST
M
ANAG
ERAN
ALYT
ICS
MAN
AGER
EMC Business Data Lake
Look Demos at http://www.fbdldemo.com/
31EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Thursday, April 14th, 15:00 UTCWatch out for : • Hadoop Everywhere: Geo-Distributed Storage
for Big Data
Pesenters:• Nikhil Joshi, EMC• Vishrut Shah,EMC
33EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
A Remark on data locality• U. C. Berkeley’s AMP Labs declared Data locality
dead in 2011• Cloudera has declared data locality dead in
Hadoop 3.0 with HDFS-EC.• Gartner has declared hadoop dead due to its limits• Hadoop will only grow and have more dependency on
it going forward.• A catalyst may be the next time I see you and uptime
for hadoop is your main concern.
34EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Simple to manage Single file system, single volume, global namespace
Massively scalable Scales from 16 TB to over 50 PB in a single cluster200GB/s throughput, 3.75M IOPS
Unmatched efficiencyOver 80% storage utilization, automated tiering and SmartDedupe
Enterprise data protectionEfficient backup and disaster recovery, and N+1 thru N+4 redundancy
Robust security and compliance optionsRBAC, Access Zones, WORM data security, File System AuditingData At Rest Encryption with SEDs, STIG hardeningCAC/PIV Smartcard authentication, FIPS OpenSSL support
Operational flexibilityMulti-protocol support including NFS, SMB, HTTP, FTP and HDFSObject and Cloud computing including OpenStack Swift
Isilon Scale-Out NAS
35EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Geo-ScaleGeo-Replicated and Distributed to multiple locations
Massively scalable Scales to billions of objects in a single namespace
Support for all file sizesSupport for individual files of any size.
Multi-TenantEfficient backup and disaster recovery, and N+1 thru N+4 redundancy
HDFS CompatibleHortonworks Certified HDFS Compatible File SystemSwift CompatibleNatively support Open Stack storageNative Cloud InterfaceNatively works with existing cloud protocols like S3 and Azure.
Elastic Cloud Storage (ECS)