big data: a storage systems perspective - storage ...e.g. map/reduce) shuffle shuffle shuffle tasks...
Post on 14-Apr-2018
228 Views
Preview:
TRANSCRIPT
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Big Data: A Storage Systems Perspective
Muthukumar Murugan Ph.D. HP Storage Division
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
In this talk …
Big data storage: Current trends
Issues with current storage options
Evolution of storage to support big data applications
2
Hadoop is not a solution to a “data” problem!
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Big Data : The Storage Concerns !
Volume Petascale / Exascale data
Velocity Frequency of generation
Variety Largely unstructured/semi
structured Value
Frequency of analysis Computation Model
Parallel tasks, scale out architecture
3
How much are you worth to Zuckerberg ?
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
A typical big data ecosystem
4
Storage Framework (e.g. HDFS, Cassandra)
High Level Language (e.g. Pig Latin, Hive QL)
Structured databases e.g. HBase, Hive etc.
Data Mining and Analytics Applications
Storage (DAS/Networked)
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Big Data Storage Model – 1
Centralized metadata node Datanodes store data in local
disks Clients
Talk to metadata node and then datanodes
e.g. Hadoop
5
Data Node
Data Node
Data Node
Name Node
Client
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Big Data Storage Model – 2
No centralized metadata node
Datanodes store data in local disks
Clients routed to appropriate
node based on hash prefix e.g. Cassandra
6
Data Node
Data Node
Data Node
Client
Data Node
Hash prefix based
routing
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Computation Model
7
Data + Compute
Node
Data + Compute
Node
Data + Compute
Node
Data + Compute
Node
Tasks (e.g. Map/Reduce)
Shuffle Shuffle Shuffle
Tasks (e.g. Map/Reduce)
Tasks (e.g. Map/Reduce)
Tasks (e.g. Map/Reduce)
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Big Data Storage Access Patterns
Typically write once, read many times
workloads Metadata lookups, object reads
Large sized blocks/objects ≈ 64 MB to 128 MB (e.g. Hadoop -MR)
Small sized accesses e.g. HBase, Cassandra
Objects Files
8
Local File System
Files
Objects
Get(), Put()
Local Disks
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Issues with Existing Storage Architecture
9
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
DAS : Not so smart !
Distributing data all over the cluster makes data management difficult
Replicated data wastage of storage space
Tightly coupled computation and storage Inflexible infrastructure
10
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Networks vs Disks: The blame game
Over the last decade … Datacenter network speeds have dramatically
improved 10 Gb/s Ethernet, optical networks Flat network topologies
Soon .. 40 Gb/s, 100 Gb/s Ethernet will be common
Disks are barely keeping up … Take away: Data locality will no more be an issue !
11
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Changing times, changing values
Value of data is constantly changing Not all data is equally popular
Recent analysis of large scale datacenters [1]
Only 10-30% of data is most popular
Differentiated storage for big data Impossible with DAS Needs sophisticated storage
12
Least valuable data
Most valuable data (frequency of analysis, time
of generation etc.) [1] Ananthanarayan et al. , HotOS 2011.
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
New applications, new requirements
Traditionally Sequential access, large blocks Task-local data access, batch jobs
Aging data, replication Remote accesses dominate
Real time queries and online jobs Row/record accesses in indexed NoSQL databases
e.g. Accumulo, Hypertable etc.
13
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Revisiting Big Data Storage
14
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Rethinking storage for big data
Shared nothing DAS vs shared storage Management vs scalability Storage bandwidth and
latency capacities “Converging” multiple
storage silos.
15
Primary Cluster
Analytics Cluster
Storage Management Layer
Datacenter
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Sharing is a virtue !
Shared nothing is extreme, inefficient but scalable
Shared storage resources Spindles, caches, network
bandwidth Scale out storage systems
Scale out object/block/file storage systems
16
Shared Nothing
Traditional Enterprise
Shared Storage
Big Data Storage
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
HA with performance guarantees
Performance guarantees Latency, BW
Data reliability and failure resilience guarantees
Big data archival with relaxed performance
numbers Compression/ deduplication
17
Storage Manager
Low Perf.
Arc
hiva
l
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Storage Federation
Federated storage management Integrate multiple storage
“islands” into an “archipelago” Varying performance/cost
characteristics Seamless data migration
Dynamic workload characteristics
Cost/value model
18
Storage Manager Software
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Heterogeneous storage clients
Primary workloads Offline batch processing
analytics jobs Real time online analytics
queries
19
Primary workloads
Real time Analytics
Offline Analytics
Converged Storage System
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Data Management options
Storage aware big data infrastructure
Storage managing big data blocks Storage tracks blocks Dynamically migrates blocks
Big data application aided storage
20
Analytics and computation
Storage System
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Storage technology trends
Flash : Flashcache, all flash arrays etc. Interleaved accesses
Non-volatile Memory Low latency, persistent tier
Fast SAN Fiber channel, 40 Gb/s iSCSI etc.
21
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Low level changes
Revisit block device access semantics Objects – files – blocks interactions
NVM / flash Access protocols, application modifications Shared caches, proportional caching
Better I/O schedulers
22
2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.
Summary and Conclusion
Needed: A change in big data storage perspective Converged storage solutions Changing big data application characteristics Emerging technologies and performance
improvements Overhaul traditional disk access semantics and
protocols
23
top related