big data: a storage systems perspective - storage ...e.g. map/reduce) shuffle shuffle shuffle tasks...

23
Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Upload: duongtram

Post on 14-Apr-2018

227 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data: A Storage Systems Perspective

Muthukumar Murugan Ph.D. HP Storage Division

Page 2: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

In this talk …

Big data storage: Current trends

Issues with current storage options

Evolution of storage to support big data applications

2

Hadoop is not a solution to a “data” problem!

Page 3: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data : The Storage Concerns !

Volume Petascale / Exascale data

Velocity Frequency of generation

Variety Largely unstructured/semi

structured Value

Frequency of analysis Computation Model

Parallel tasks, scale out architecture

3

How much are you worth to Zuckerberg ?

Page 4: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

A typical big data ecosystem

4

Storage Framework (e.g. HDFS, Cassandra)

High Level Language (e.g. Pig Latin, Hive QL)

Structured databases e.g. HBase, Hive etc.

Data Mining and Analytics Applications

Storage (DAS/Networked)

Page 5: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Model – 1

Centralized metadata node Datanodes store data in local

disks Clients

Talk to metadata node and then datanodes

e.g. Hadoop

5

Data Node

Data Node

Data Node

Name Node

Client

Page 6: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Model – 2

No centralized metadata node

Datanodes store data in local disks

Clients routed to appropriate

node based on hash prefix e.g. Cassandra

6

Data Node

Data Node

Data Node

Client

Data Node

Hash prefix based

routing

Page 7: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Computation Model

7

Data + Compute

Node

Data + Compute

Node

Data + Compute

Node

Data + Compute

Node

Tasks (e.g. Map/Reduce)

Shuffle Shuffle Shuffle

Tasks (e.g. Map/Reduce)

Tasks (e.g. Map/Reduce)

Tasks (e.g. Map/Reduce)

Page 8: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Big Data Storage Access Patterns

Typically write once, read many times

workloads Metadata lookups, object reads

Large sized blocks/objects ≈ 64 MB to 128 MB (e.g. Hadoop -MR)

Small sized accesses e.g. HBase, Cassandra

Objects Files

8

Local File System

Files

Objects

Get(), Put()

Local Disks

Page 9: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Issues with Existing Storage Architecture

9

Page 10: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

DAS : Not so smart !

Distributing data all over the cluster makes data management difficult

Replicated data wastage of storage space

Tightly coupled computation and storage Inflexible infrastructure

10

Page 11: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Networks vs Disks: The blame game

Over the last decade … Datacenter network speeds have dramatically

improved 10 Gb/s Ethernet, optical networks Flat network topologies

Soon .. 40 Gb/s, 100 Gb/s Ethernet will be common

Disks are barely keeping up … Take away: Data locality will no more be an issue !

11

Page 12: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Changing times, changing values

Value of data is constantly changing Not all data is equally popular

Recent analysis of large scale datacenters [1]

Only 10-30% of data is most popular

Differentiated storage for big data Impossible with DAS Needs sophisticated storage

12

Least valuable data

Most valuable data (frequency of analysis, time

of generation etc.) [1] Ananthanarayan et al. , HotOS 2011.

Page 13: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

New applications, new requirements

Traditionally Sequential access, large blocks Task-local data access, batch jobs

Aging data, replication Remote accesses dominate

Real time queries and online jobs Row/record accesses in indexed NoSQL databases

e.g. Accumulo, Hypertable etc.

13

Page 14: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Revisiting Big Data Storage

14

Page 15: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Rethinking storage for big data

Shared nothing DAS vs shared storage Management vs scalability Storage bandwidth and

latency capacities “Converging” multiple

storage silos.

15

Primary Cluster

Analytics Cluster

Storage Management Layer

Datacenter

Page 16: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Sharing is a virtue !

Shared nothing is extreme, inefficient but scalable

Shared storage resources Spindles, caches, network

bandwidth Scale out storage systems

Scale out object/block/file storage systems

16

Shared Nothing

Traditional Enterprise

Shared Storage

Big Data Storage

Page 17: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

HA with performance guarantees

Performance guarantees Latency, BW

Data reliability and failure resilience guarantees

Big data archival with relaxed performance

numbers Compression/ deduplication

17

Storage Manager

Low Perf.

Arc

hiva

l

Page 18: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Storage Federation

Federated storage management Integrate multiple storage

“islands” into an “archipelago” Varying performance/cost

characteristics Seamless data migration

Dynamic workload characteristics

Cost/value model

18

Storage Manager Software

Page 19: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Heterogeneous storage clients

Primary workloads Offline batch processing

analytics jobs Real time online analytics

queries

19

Primary workloads

Real time Analytics

Offline Analytics

Converged Storage System

Page 20: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Data Management options

Storage aware big data infrastructure

Storage managing big data blocks Storage tracks blocks Dynamically migrates blocks

Big data application aided storage

20

Analytics and computation

Storage System

Page 21: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Storage technology trends

Flash : Flashcache, all flash arrays etc. Interleaved accesses

Non-volatile Memory Low latency, persistent tier

Fast SAN Fiber channel, 40 Gb/s iSCSI etc.

21

Page 22: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Low level changes

Revisit block device access semantics Objects – files – blocks interactions

NVM / flash Access protocols, application modifications Shared caches, proportional caching

Better I/O schedulers

22

Page 23: Big Data: A Storage Systems Perspective - Storage ...e.g. Map/Reduce) Shuffle Shuffle Shuffle Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) Tasks (e.g. Map/Reduce) 2013 SNIA Analytics

2013 SNIA Analytics and Big Data Summit. © HP Storage Division. All Rights Reserved.

Summary and Conclusion

Needed: A change in big data storage perspective Converged storage solutions Changing big data application characteristics Emerging technologies and performance

improvements Overhaul traditional disk access semantics and

protocols

23