gpfs-snc: a cluster file system for analytics and clouds

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

GPFS-SNC: A Cluster File System for Analytics and Clouds

Prasenjit Sarkar, IBM Research

http://images.google.com/imgres?imgurl=http://www.forrestiandola.com/forrest_iandola_supercomputing_2010_sc10.jpg&imgrefurl=http://www.forrestiandola.com/blog/2010/10/supercomputing2010/&usg=__7UXxmkVFQvmVIDaIgHWrUW4J1FQ=&h=277&w=700&sz=44&hl=en&start=4&sig2=vcZ5w6u3JEi25c6b0eNy2g&zoom=0&itbs=1&tbnid=OqsxUgN5kZyYHM:&tbnh=55&tbnw=140&prev=/images%3Fq%3Dsupercomputing%2B2010%26hl%3Den%26gbv%3D2%26tbs%3Disch:1&ei=LobdTPyTOYiosQPw6eivCg�


Analytics and Clouds: Key growth area

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.3

GPFS: Parallel File System

Extensions for Shared Nothing Architectures:

LocalityWrite AffinityMetablocksPipelined replicationDistributed recovery

GPFS Architecture:

Cluster: thousands of nodes, fast reliable communication, common admin domain.

Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service.

Parallel: data and metadataflow to/from all nodes from/to all disks in parallel; files striped across all disks.

LocalDisk

Node

LocalDisk

Node

LocalDisk

Node

LocalDisk

Node

GPFS common file system namespace


Motivation:SANs are built for latency not bandwidth

Advantages of using GPFS:High scale (thousands of nodes, petabytes

of storage), high performance, high availability, data integrity,

POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication)

Challenges:Adapt GPFS to shared nothing clusters Maximize application performance relative

to costFailure is common, network is a bottleneck

SAN

I/O bottleneck

Traditional architecture

Shared nothing environment

Scale out performance

1 server -> 1 GBps

100 servers-> 100 GBps

0

200

400

600

800

1000

Seq w

rite

Seq r

ead

Rand

om Re

ad

Rand

om W

rite

Backw

ard Re

ad

Reco

rd rew

rite

Stride

Read

gpfs-san gpfs-snc

GPFS-SNC: Motivation


5

Multiple Block Sizes

InfoSphereBig Insights

POSIX Map Reduce DisasterRecovery

Node N

MetadataN

SAN

Node 1

Data 1

Metadata1

Solid StateDisks

CacheNode 2

Metadata2

Local or Rack Attached Disks

Cache Cache

Replicated metadata: No single point of failure

Multi-tiered storage with automated policy-based migration

Data 2 Data N

Smart Caching

Data Layout Control

High Scale: 1000 nodes, PBs of data

GPFS-SNC: Architecture


Locality Awareness

MapReduce Scheduler assigns tasks to nodes where data chunks reside needs to know the location

of each chunk GPFS API maps each chunk to its

node location Map is compacted to optimize its

size GPFS-Hadoop Connector

Adapts Hadoop to transparently use GPFS APIs

No changes to Hadoop

N1

D1

N1:D1

N1Task Engine: Location Map

Scheduler

Read Location Map

Schedule F(D1) on N1

F(D1)

MapReduce

GPFS

ConnectorLocality

0

50

100

150

w/ locality w/o locality

Grep (min)


Metablocks MapReduce workloads have two types of access:

Small blocks (eg: 512KB) for Index File Lookups

Large blocks (eg: 64MB) for File scans Solution: Multiple block sizes in the same file

system New allocation scheme

Metablock factor for block size Effective block size = block size * Metablock

factor File blocks are laid out based on effective block

size

Application “meta-block”FS block

New allocation policy

0

5

10

15

20

(512,256KB)

(128,1MB)

(32,4MB)

(8,16MB)

(2,64MB)

Grep (min)

(Metablock factor, Block Size)

OptimumBlock Size:

Caching and pre-fetching

effects at larger block

sizes

Effective block size:128 MB


Write Affinity Wide striping with multiple replicas:

No locality of data in a particular node

Bad for commodity architectures due to excessive network traffic

Write Affinity for a set of files (or file-set) on a given node All relevant files will be

allocated on the node Configurability of replicas

Each replica can be configured for either wide striping or write affinity

Toplogical support: DataCenter.Cluster.Rack.

Node Allows flexibility with

networks and failover

N1

File

N2

N3

Read File 1

Wide striped layout: Read over local disk + network

N1

N2

N3

N4

Read File 1

Write affinity layout:Read over only local disk

N1

N2

N3

N4

Read File 1

Write affinity and wide striping with replication: Read over only local disk, recover from wide striped replica

ParallelRecovery

0

5

10

15

20

25

30

35

WriteAffinity WideStriping

TeraGen (min)


Pipelined Replication

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

Node 1 Node 2 Node 3 Node 4 Node 5

N1

Client

N2 N3

B0

B1 B2

Write

Write Pipeline

0

10

20

30

40

Single-source Pipelined

TeraGen(min)

Higher Degree of replication Pipelining for lower latency

Single source does not utilize the bandwidth effectively

Able to saturate 1 Gbps Ethernet link


Fast Recovery Handling failures is policy driven

Incorporated node failure and disk failure triggers

Policy-based recovery Restripe on a node

failure or disk failure Alternatively, rebuild

the disk when the disk is replaced

Fast recovery Incremental recovery

Keep track of changes during failure and recover what is needed

Distributed restripe Restripe load is spread

out over all the nodes Quickly figure out what

blocks are needed to be recovered when a node fails

N1N1N1

File

N2

N3

Well replicated fileN2 rebooted:

Writes go to N1 and N3N2 comes back and catches up

by copying missed writes from N1 and N3 in parallel

N2

N3

N2

N3

New write

Copy Deltas

0

5

10

15

20

0 1 2

Grep(Min)

Failed Nodes


MapReduce

0

500

1000

1500

2000

Exec

ution

Time

Postmark Teragen Grep CacheTest

HDFS GPFS-SNC Query languages like Pig and JAQL need good random I/O performance

Sort requires better sequential throughput GPFS is twice HDFS for both of the

above For document index lookups, client side

caching is a big win 17x throughput speedup

HadoopIndexing (HDFS)

DatabaseUpload (ext3)

Web ServiceLayer

CopyFetch

HDFS:Extra copy overhead and

network fetch, separate clusters for analytics and database

Hadoop Indexing + Database Upload (GPFS)

Web ServiceLayer

Cache

GPFS:Single cluster for analytics and database,

no copying required, caching for web layer

Workload Isolation Proven data integrity

Replicated metadata services– Yahoo keeps 3 copies of 3

versions of HDFS because of unknown data integrity [1]

– Quantcast deletes files once HDFS is 50% full [2]

[1] Care and Feeding of Hadoop Clusters, Marc Nicosia, Usenix 2009[2] The Komos Distributed File System, Sriram Rao, Quantcast Inc.


In-memory Analytic Engines Why do we need a file system Three reasons:

Permanent data store Point of ingestion for data Continual process – high bandwidth requirements

Logging In-memory transactions are logged to disk or flash High IOPS requirement

Failure Recovery Data updates must be replicated for recovery Efficient use of network bandwidth, workload isolation

IO Channels (Direct/Interconnect)

Node - 1SMP (16-64 cores)


Node - 2


Node - N10Gb Ethernet

Node Interconnect

SMP (16-64 cores)SMP (16-64 cores)


Scalable Storage for Data Warehousing

020406080

100120

Exec

utio

n Tim

e (s)

1-node

TPC-HA2-2 Query

ext3 GPFS-SNC GPFS

Goal: Provide scale-out storage capability for warehousing

Motivation: Exploit file system features to accelerate database layer Performance

Linearly scale out with the number of nodes Replication: Can saturate Infiniband and 10 GbE networks

Benefits: Client-side Caching, Disaster Recovery, Tiering MapReduce techniques

Cost: Lower than a SAN configuration

SMP (4-16 cores)


BCU - 1SMP (4-16 cores)


BCU - 2SMP (4-16 cores)


BCU - NEthernet/Infiniband

Node Interconnect

Scale-out

0

200

400

600

800

1000

Seq

writ

e

Seq

read

Rand

om R

ead

Rand

om W

rite

Back

war

d Re

ad

Reco

rd re

writ

e

Strid

e Re

ad

gpfs-san gpfs-snc


EncryptionSecure Erase

Yes No

GuaranteedImmutability

Yes No

Other Attributes

Latency/IOPs

I/O Characteristics Seq

Protocols Block WebFile

RPO=0 No DR

None

Cost Tolerance Low LowestLower

ArchiveIndex

ArchiveContent

Logical Protection

Physical Protection

General PurposeNAS Extension

Video,Docs, etc.

Production Compute

<10ms, high IOPs >1s, low IOPs

Random

RPO=hours

Periodic Backup Versioning of Objects

Storage Clouds: Wide Variety


Tier-1 Storage for Compute Cloud

Based on Generic Compute Cloud Data ingested into Compute Cloud onto

GPFS-SNC Cluster for staging Referred to as “repository”

Parallel Copy into analytics GPFS-SNC cluster for processing

Analytics works in parallel on the data using Compute Cloud resources

Results of analytics put back into staging area for customer verification

Customer

ComputeCloud

Repository: Data, Results8 VM, 4 TB GPFS-SNC Cluster

Analytics Cloud8 VM, 4 TB GPFS-SNC Cluster


Image-deployment with GPFS-SNC

Advantages of GPFS-SNC No distinct storage and

compute nodes: better cloud utilization

Automated replication for failure recovery

Automatic scalability Intelligent VM placement Instance deployment time

reduced from ~6 min to 8 sec with GPFS-SNC and semantic caching

IDE IDE IDE IDE

iSCSIiSCSI

Curr

ent a

rch

IDE+GPFS

IDE+GPFS

IDE+GPFS

IDE+GPFS

Prop

osed

arc

h

IDE = Image Deployment Engine


17

VMware ESXi

Virtual Machines

OS

App

Local HDD/SSD

OS

App

OS

AppGPFS-SNC

VMware ESXi

Virtual Machines

OS

App

Local HDD/SSD

OS

App

OS

AppGPFS-SNC

VMware ESXi

Virtual Machines

OS

App

Local HDD/SSD

OS

App

OS

AppGPFS-SNC

POSIX Map-Reduce Disaster Recovery

SANStorage

Multi-Tiered Storage with automated policy-based migration

InfosphereStreams

InfosphereBigInsights

Scaling VMWare with local disks


GPFS Cluster H1 GPFS Cluster H2

GPFS Cluster A2GPFS Cluster A1

GPFS Cluster Z1

A1 A2 H2H1 H2 A1Z1 H1 H2

Global View Service

DC-H DC-Z

DC-A

WSS / IA MdlwrGPFS-SNC

Put Green

Duplicate

Local Catalog Local Catalog

Local Catalog

Get Red

Local Catalog

Local Catalog

A1 A2 Z1

Put Purple

















WSS Client

FOCUS: Fixed content archive cloud


Futures

Scale and performance: 10,000+ nodes Recovery: Scale with number of nodes Dynamic caching and tiering: data where its needed Space Efficiency: Coding, compression and de-

duplication Advanced workload isolation

gpfs-snc: a cluster file system for analytics and clouds

Documents