big data: a storage systems perspective - storage ...e.g. map/reduce) shuffle shuffle shuffle tasks...

Post on 14-Apr-2018

228 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data: A Storage Systems Perspective

Muthukumar Murugan Ph.D. HP Storage Division

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

In this talk …

Big data storage: Current trends

Issues with current storage options

Evolution of storage to support big data applications

2

Hadoop is not a solution to a “data” problem!

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data : The Storage Concerns !

Volume Petascale / Exascale data

Velocity Frequency of generation

Variety Largely unstructured/semi

structured Value

Frequency of analysis Computation Model

Parallel tasks, scale out architecture

3

How much are you worth to Zuckerberg ?

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

A typical big data ecosystem

4

Storage Framework (e.g. HDFS, Cassandra)

High Level Language (e.g. Pig Latin, Hive QL)

Structured databases e.g. HBase, Hive etc.

Data Mining and Analytics Applications

Storage (DAS/Networked)

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Model – 1

Centralized metadata node Datanodes store data in local

disks Clients

Talk to metadata node and then datanodes

e.g. Hadoop

5

Data Node

Data Node

Data Node

Name Node

Client

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Model – 2

No centralized metadata node

Datanodes store data in local disks

Clients routed to appropriate

node based on hash prefix e.g. Cassandra

6

Data Node

Data Node

Data Node

Client

Data Node

Hash prefix based

routing

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Computation Model

7

Data + Compute

Node

Data + Compute

Node

Data + Compute

Node

Data + Compute

Node

Tasks (e.g. Map/Reduce)

Shuffle Shuffle Shuffle

Tasks (e.g. Map/Reduce)

Tasks (e.g. Map/Reduce)

Tasks (e.g. Map/Reduce)

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Access Patterns

Typically write once, read many times

workloads Metadata lookups, object reads

Large sized blocks/objects ≈ 64 MB to 128 MB (e.g. Hadoop -MR)

Small sized accesses e.g. HBase, Cassandra

Objects Files

8

Local File System

Files

Objects

Get(), Put()

Local Disks

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Issues with Existing Storage Architecture

9

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

DAS : Not so smart !

Distributing data all over the cluster makes data management difficult

Replicated data wastage of storage space

Tightly coupled computation and storage Inflexible infrastructure

10

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Networks vs Disks: The blame game

Over the last decade … Datacenter network speeds have dramatically

improved 10 Gb/s Ethernet, optical networks Flat network topologies

Soon .. 40 Gb/s, 100 Gb/s Ethernet will be common

Disks are barely keeping up … Take away: Data locality will no more be an issue !

11

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Changing times, changing values

Value of data is constantly changing Not all data is equally popular

Recent analysis of large scale datacenters [1]

Only 10-30% of data is most popular

Differentiated storage for big data Impossible with DAS Needs sophisticated storage

12

Least valuable data

Most valuable data (frequency of analysis, time

of generation etc.) [1] Ananthanarayan et al. , HotOS 2011.

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

New applications, new requirements

Traditionally Sequential access, large blocks Task-local data access, batch jobs

Aging data, replication Remote accesses dominate

Real time queries and online jobs Row/record accesses in indexed NoSQL databases

e.g. Accumulo, Hypertable etc.

13

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Revisiting Big Data Storage

14

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Rethinking storage for big data

Shared nothing DAS vs shared storage Management vs scalability Storage bandwidth and

latency capacities “Converging” multiple

storage silos.

15

Primary Cluster

Analytics Cluster

Storage Management Layer

Datacenter

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Sharing is a virtue !

Shared nothing is extreme, inefficient but scalable

Shared storage resources Spindles, caches, network

bandwidth Scale out storage systems

Scale out object/block/file storage systems

16

Shared Nothing

Traditional Enterprise

Shared Storage

Big Data Storage

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

HA with performance guarantees

Performance guarantees Latency, BW

Data reliability and failure resilience guarantees

Big data archival with relaxed performance

numbers Compression/ deduplication

17

Storage Manager

Low Perf.

Arc

hiva

l

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Storage Federation

Federated storage management Integrate multiple storage

“islands” into an “archipelago” Varying performance/cost

characteristics Seamless data migration

Dynamic workload characteristics

Cost/value model

18

Storage Manager Software

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Heterogeneous storage clients

Primary workloads Offline batch processing

analytics jobs Real time online analytics

queries

19

Primary workloads

Real time Analytics

Offline Analytics

Converged Storage System

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Data Management options

Storage aware big data infrastructure

Storage managing big data blocks Storage tracks blocks Dynamically migrates blocks

Big data application aided storage

20

Analytics and computation

Storage System

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Storage technology trends

Flash : Flashcache, all flash arrays etc. Interleaved accesses

Non-volatile Memory Low latency, persistent tier

Fast SAN Fiber channel, 40 Gb/s iSCSI etc.

21

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Low level changes

Revisit block device access semantics Objects – files – blocks interactions

NVM / flash Access protocols, application modifications Shared caches, proportional caching

Better I/O schedulers

22

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Summary and Conclusion

Needed: A change in big data storage perspective Converged storage solutions Changing big data application characteristics Emerging technologies and performance

improvements Overhaul traditional disk access semantics and

protocols

23

top related