exploring the wider world of big data- vasalis kapsalis
DESCRIPTION
Every second of every day you hear about Electronic systems creating ever increasing quantities of data. Systems in markets such as finance, media, healthcare, government and scientific research feature strongly in the Big Data processing conversation. While extracting business value from Big Data is forecast to bring customer and competitive advantage and benefits. In this session hear Vas Kapsalis, NetApp Big Data Business Development Manager, discuss his views and experience on the wider world of Big Data.TRANSCRIPT
The Big Data Landscape
Entering a New Era of Scale
2
Convergence of Technology Disrupters
Create Opportunity
NetApp Confidential - Internal Use Only
Cloud
Social Mobile
Internet of
Things
Big Data
Traditional Structured and Replicated Data mix shift is driven by:
− Efficiency (Dedup, Compr, Thin Prov, SATA)
− Growth in new category of storage consumers using cloud / content depots
Unstructured Data (files and objects) in traditional storage + Content depots / Cloud) will be the largest storage category by 2014
− Content depots / Cloud expected to be 95% unstructured data
Revenue Share by Segment
Traditional structured
Traditional replicated
Traditional unstructured
Content depots / public cloud
Unstructured Data Growth Dominates
Not Even to The “Peak”
Estimated size of the
digital universe in 2020
40 Zettabytes 5 Billion Smart phones
30 Billion Pieces of new content to
Facebook per month
5
Technology Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
VISIBILITY
TIME
80% Unstructured
data
Big Data Is All Data From Everywhere
Transactional Data
Machine Data
Social Data
Enterprise Content
Fundamentally changes your business
The Jet way
The Call Center
Big Data Vendor Landscape A Lot of Hype and Buzz – Everyone is Jumping In
Market is expected to grow from $3.2 billion
in 2010 to $16.9 billion in 2015
NoSQL $2Bn PA by 2015
Most firms are taking a pragmatic approach
Big data is in the very early stages of maturity
Best practices are not mature
IDC Big Data Survey
7
Nov-11
400
350
300
250
200
150
100
50
0 Jan-08
Cloudera series B
MapR series A
Cloudera series C
10gen series D
MapR series B
DataStax series B
Neo Technology series A
Opera Solutions series A
Platfora series A
Couchbase series C
Cloudera series D
Funding for Hadoop and NoSQL
"The Big Data market is expanding rapidly …
For technology buyers, opportunities exist to
use Big Data technology to improve
operational efficiency and to drive innovation.
Use cases are already present across
industries and geographic regions."
Dan Vesset, Vice President, IDC
451 Research
Data Growth Impact on Business
8
Complexity
Volume Speed
Bu
sin
es
s V
elo
cit
y
Inflection
Point
Information Becomes
a Propellant to Business
Data Becomes a
Burden to IT Infrastructure
2010 2020
“Big Data” refers to datasets whose size is
beyond the ability of typical tools to capture,
store, manage and analyze
Why Should You Care? It’s the Value of Your Data
Top line revenue
– Leverage their data
assets into business
advantage
Bottom Line savings
– Lower the cost of
compliance
– Manage ever growing
data efficiently
Over 1PB of data
Growth of 175% YOY
90 days of data within
24 hours of a failure
5 Billion Records
Anywhere, Anytime
Faster time to market
50% Increase in Revenue
9
NetApp Big Data
Why NetApp? Practical solutions that solve today’s problems
Get
Control
NetApp helps you turn your
exploding data from threat to
opportunity. Manage your data
effectively and affordably.
Break
Through
Break through the limits. With
NetApp, you can take on even the
most massive and complex data
projects.
Gain
Insight
Turn insight to action. NetApp helps
you get to clarity and insight faster
and more reliably.
11
Experience Managing Data at Scale
12
100 Customers
50 Customers
10 Customers
4 Customers 100 PB
50 PB
20 PB
10 PB
NetApp’s Largest Customer
NetApp Big Data Strategy
Best of breed storage for Big
Data Applications
Create deep integration and
value add
Build on open standards with
best-in-class partnerships
Validate with Ecosystem
Leaders
– Complete server, network and
storage “Racks”
– Delivered via trusted high-value
partners
13
Open
Best-of-Breed
Choice
Industry-Leading Storage Innovation
14
Flash Arrays for ultra-high performance
E-Series for price-performance at scale
StorageGRID for web scale object storage
Clustered Data ONTAP for Shared Infrastructure
Corporate
Data Centers
Cloud
Data Centers
Big Content Retain forever, multi-site distribution
Big Bandwidth Ingest, Process, Stream
Big Analytics Reduce, Analyze, Report
Cloud Private/Public
Retain, Distribute
Big Data Building Blocks
Applications
Extract
Retain, Distribute
Store
Retrieve
15
16
Analytics Oriented Business Processing
RDBMS General Purpose DB
Data organized to
align with schemas
Fixed consistency
model
Complex queries
supported
Volume based data
management
Columnar DB Analytics Oriented
Data organized in
column files
Tabular interface
without rigid schemas
Fast column scans
Multiple consistency
models
Transaction granular
data management
Document Store Transaction Oriented
Data organized in
data structures in
memory
Schemaless
transaction store for
structured data
High transactional
performance
K-V Store Metadata Service
Oriented
Data organized in key
value pairs
Suitable for metadata
services with CMS’
Associated with
object services
Transaction Processing
Realtime Analytics
Business Applications
Memory Ingest Disk/Flash Tier
Query-based
Retrieval
Commit
Federated Database Store (Build/Buy/Partner)
Persisted
Commit
Transaction granular data
resilience, recoverability &
protection at line speeds
Data organization
optimized by query
interface
Performance
optimized query
service
Analytics Technologies to look out for!
Columnar
DBs (Analytics
Oriented)
Document
Stores (Transaction
Oriented)
Key-Value
Stores (Content/Object
Service)
Graph
DBs (Niche)
Relational DBs
Row-oriented
RDBMS’
Datacenter Multi - Datacenter
• ACID constrained
• Complete query set
• Limited availability
• High consistency
• Rich query set
• Good availability
• Tuneable consistency
• Limited query set
• Highest/WAN availability
Old World New World
Analytics & Enterprise Apps Environment
19
Sensors
Applications
Logs
Location/GPS
Mobile Devices
Storage (All other storage, i.e. internal DAS)
Content
Repositories Shared Storage
Infrastructure
Storage File Systems
Data Management
Analytics
Applications
Reporting/Dashboard/Visualization
ETL
OLAP
OLTP
Other Data Sources
OLAP ETL
Storage Data Management
NFS/sNFS/pNFS
NetApp Confidential – Limited Use
Some problems require an Enterprise Class
Hadoop solution
20
Enterprise Class Hadoop
Packaged ready-to-deploy modular Hadoop cluster
The data has intrinsic value $$$ Capacity and compute requirements
expanding very fast Higher storage performance Real human consequences if the system
fails (Threats, treatments, financial losses) System has to allow for asymmetric growth
Commodity, Off the Shelf Hadoop
Values associated with early adopters of Hadoop
Social Media Space Contributors to Apache Strong bias to JBOD Skeptical of ALL vendors
Enterprise Class Hadoop
Packaged ready-to-deploy modular compute intensive Hadoop cluster
Compute intensive applications
Video, imaging analysis
Extremely tight Service Level expectations
Severe financial consequences if the
data analytic application or service is
run late
Enterprise Class Hadoop
Packaged ready-to-deploy modular storage intensive Hadoop cluster
Storage intensive applications
Additional CPUs does not help run time
Financial ticker data analysis
Extremely tight Service Level expectations
Need deeper storage per datanode
Co
mp
ute
Po
we
r
Storage Capacity
NetApp Confidential – Limited Use
21
NetApp Open Solution for Hadoop
Easy to Deploy, Manage and Scale
Uses High Performance storage
– Resilient and Compact
– RAID Protection of Data
– Less Network Congestion
Raw Capacity and density
– 120TB or 180TB in 4U
– Fully serviceable storage system
Reliability
– Hardware RAID & hot swap prevent job restart due to node go off-line in case of media failure
– Reliable metadata (Name Node)
Enterprise Class Hadoop
Map Reduce
NameNode
DataNodes / TaskTracker
DataNodes / TaskTracker
:
HDFS
Secondary NameNode
4 separate shared nothing partitions
NetApp Confidential – Limited Use
JobTracker
FAS2040
E2660
NetApp Open Solution for Hadoop Validated Benefits for the Enterprise
Improved cluster performance by 62%
Completed jobs 200% faster under
drive failure
Delivered linear performance scalability
as nodes, data grew
Per-server capacity increase of 1.5x
The NetApp Open Solution for Hadoop improves capacity
and performance efficiency and recoverability compared to
a server-based DAS deployment.
- ESG, 2012
Optimizing Performance and Stay Healthy
23
Source: Garrett, Brian and Lockner, Julie, “NetApp Open Solution for Hadoop”, ESG Report,
May 2012, http://bit.ly/LyYG0t
Network Overhead Useful Work
Availability and
Resiliency Burst Handling and
Queuing
Oversubscription
Ratio
Data Node Network
Speed
Network
Latency
Source: Cisco: http://bit.ly/yL54Ts
DAS vs. NetApp footprint
DAS Option 2RU, CPU: 2x8 cores, RAM: 48GB, Disk:
24 TB
1 Rack(42RU): 20 servers (320 cores, 960GB, 480TB)
6 Racks: 1920 cores, 5.7TB RAM, 2.8 PB Storage (120 servers)
NetApp Option 1RU, CPU: 2x8 cores, RAM: 48GB, Disk: 2
TB (8TB Max(Optional PIXI Boot Diskless)
1 Rack (42RU)
CPU and Memory: 24 servers(6:1), 384 cores, 1.152TB
Storage: 4 E2660 720TB
4 Racks: 1536 cores, 4.6TB, 2.8 PB (96 servers)
Case Study: ASUP NetApp Analytics
25
Gateways
• 800K ASUPs every week
• 40% coming over the weekend
Extract Transform
Load
Data Warehouse Data Mart
Data Mart
ETL
• Data needs to be parsed and loaded in 15 minutes
Data Warehouse
• Only 5% of data goes into the data warehouse, rest unstructured, yet it’s growing 7-10 TB per month
• No easy way to access this unstructured content
Reporting
• Numerous mining requests are not satisfied currently
• Huge untapped potential of valuable insight
Finally, the incoming load doubles every 16 months!
NetApp Proprietary - Limited Use Only
Case Study: NetApp Large-Scale Analytics
CHALLENGE NETAPP
SOLUTION BENEFITS
4 weeks to run a query on 24 billion unstructured records
10-node Hadoop Cluster
Time reduced from 4 weeks to 10.5 hours
Impossible to run a query: 240 billion unstructured records
Previously impossible, now achievable in just 18 hours
26 NetApp Proprietary - Limited Use Only
Big Data System Integrators Solutions Built on NetApp®
Integrated Big Data Solutions and Expertise
Planning and implementation expertise for Big Data
Turn-key solution stacks and Big Data services
27
Next Steps - Team with the Experts
Strategic Assessment – Business goals
– Data growth needs
– Use case discovery (partner
delivery)
Consult – Solution architecture and design
(NetApp delivery)
Deploy – Installation and implementation
(NetApp delivery)
– Solution implementation (partner
delivery)
28
Support options:
Global support available
from NetApp and partners
NetApp Confidential - Internal Use Only