gpfs-snc: a cluster file system for analytics and clouds
TRANSCRIPT
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
GPFS-SNC: A Cluster File System for Analytics and Clouds
Prasenjit Sarkar, IBM Research
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Analytics and Clouds: Key growth area
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.3
GPFS: Parallel File System
Extensions for Shared Nothing Architectures:
LocalityWrite AffinityMetablocksPipelined replicationDistributed recovery
GPFS Architecture:
Cluster: thousands of nodes, fast reliable communication, common admin domain.
Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service.
Parallel: data and metadataflow to/from all nodes from/to all disks in parallel; files striped across all disks.
LocalDisk
Node
LocalDisk
Node
LocalDisk
Node
LocalDisk
Node
GPFS common file system namespace
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.4
Motivation:SANs are built for latency not bandwidth
Advantages of using GPFS:High scale (thousands of nodes, petabytes
of storage), high performance, high availability, data integrity,
POSIX semantics, workload isolation, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replication)
Challenges:Adapt GPFS to shared nothing clusters Maximize application performance relative
to costFailure is common, network is a bottleneck
SAN
I/O bottleneck
Traditional architecture
Shared nothing environment
Scale out performance
1 server -> 1 GBps
100 servers-> 100 GBps
0
200
400
600
800
1000
Seq w
rite
Seq r
ead
Rand
om Re
ad
Rand
om W
rite
Backw
ard Re
ad
Reco
rd rew
rite
Stride
Read
gpfs-san gpfs-snc
GPFS-SNC: Motivation
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
5
Multiple Block Sizes
InfoSphereBig Insights
POSIX Map Reduce DisasterRecovery
Node N
MetadataN
SAN
Node 1
Data 1
Metadata1
Solid StateDisks
CacheNode 2
Metadata2
Local or Rack Attached Disks
Cache Cache
Replicated metadata: No single point of failure
Multi-tiered storage with automated policy-based migration
Data 2 Data N
Smart Caching
Data Layout Control
High Scale: 1000 nodes, PBs of data
GPFS-SNC: Architecture
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.6
Locality Awareness
MapReduce Scheduler assigns tasks to nodes where data chunks reside needs to know the location
of each chunk GPFS API maps each chunk to its
node location Map is compacted to optimize its
size GPFS-Hadoop Connector
Adapts Hadoop to transparently use GPFS APIs
No changes to Hadoop
N1
D1
N1:D1
N1Task Engine: Location Map
Scheduler
Read Location Map
Schedule F(D1) on N1
F(D1)
MapReduce
GPFS
ConnectorLocality
0
50
100
150
w/ locality w/o locality
Grep (min)
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.7
Metablocks MapReduce workloads have two types of access:
Small blocks (eg: 512KB) for Index File Lookups
Large blocks (eg: 64MB) for File scans Solution: Multiple block sizes in the same file
system New allocation scheme
Metablock factor for block size Effective block size = block size * Metablock
factor File blocks are laid out based on effective block
size
Application “meta-block”FS block
New allocation policy
0
5
10
15
20
(512,256KB)
(128,1MB)
(32,4MB)
(8,16MB)
(2,64MB)
Grep (min)
(Metablock factor, Block Size)
OptimumBlock Size:
Caching and pre-fetching
effects at larger block
sizes
Effective block size:128 MB
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.8
Write Affinity Wide striping with multiple replicas:
No locality of data in a particular node
Bad for commodity architectures due to excessive network traffic
Write Affinity for a set of files (or file-set) on a given node All relevant files will be
allocated on the node Configurability of replicas
Each replica can be configured for either wide striping or write affinity
Toplogical support: DataCenter.Cluster.Rack.
Node Allows flexibility with
networks and failover
N1
File
N2
N3
Read File 1
Wide striped layout: Read over local disk + network
N1
N2
N3
N4
Read File 1
Write affinity layout:Read over only local disk
N1
N2
N3
N4
Read File 1
Write affinity and wide striping with replication: Read over only local disk, recover from wide striped replica
ParallelRecovery
0
5
10
15
20
25
30
35
WriteAffinity WideStriping
TeraGen (min)
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.9
Pipelined Replication
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5
Node 1 Node 2 Node 3 Node 4 Node 5
N1
Client
N2 N3
B0
B1 B2
Write
Write Pipeline
0
10
20
30
40
Single-source Pipelined
TeraGen(min)
Higher Degree of replication Pipelining for lower latency
Single source does not utilize the bandwidth effectively
Able to saturate 1 Gbps Ethernet link
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.10
Fast Recovery Handling failures is policy driven
Incorporated node failure and disk failure triggers
Policy-based recovery Restripe on a node
failure or disk failure Alternatively, rebuild
the disk when the disk is replaced
Fast recovery Incremental recovery
Keep track of changes during failure and recover what is needed
Distributed restripe Restripe load is spread
out over all the nodes Quickly figure out what
blocks are needed to be recovered when a node fails
N1N1N1
File
N2
N3
Well replicated fileN2 rebooted:
Writes go to N1 and N3N2 comes back and catches up
by copying missed writes from N1 and N3 in parallel
N2
N3
N2
N3
New write
Copy Deltas
0
5
10
15
20
0 1 2
Grep(Min)
Failed Nodes
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.11
MapReduce
0
500
1000
1500
2000
Exec
ution
Time
Postmark Teragen Grep CacheTest
HDFS GPFS-SNC Query languages like Pig and JAQL need good random I/O performance
Sort requires better sequential throughput GPFS is twice HDFS for both of the
above For document index lookups, client side
caching is a big win 17x throughput speedup
HadoopIndexing (HDFS)
DatabaseUpload (ext3)
Web ServiceLayer
CopyFetch
HDFS:Extra copy overhead and
network fetch, separate clusters for analytics and database
Hadoop Indexing + Database Upload (GPFS)
Web ServiceLayer
Cache
GPFS:Single cluster for analytics and database,
no copying required, caching for web layer
Workload Isolation Proven data integrity
Replicated metadata services– Yahoo keeps 3 copies of 3
versions of HDFS because of unknown data integrity [1]
– Quantcast deletes files once HDFS is 50% full [2]
[1] Care and Feeding of Hadoop Clusters, Marc Nicosia, Usenix 2009[2] The Komos Distributed File System, Sriram Rao, Quantcast Inc.
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
In-memory Analytic Engines Why do we need a file system Three reasons:
Permanent data store Point of ingestion for data Continual process – high bandwidth requirements
Logging In-memory transactions are logged to disk or flash High IOPS requirement
Failure Recovery Data updates must be replicated for recovery Efficient use of network bandwidth, workload isolation
IO Channels (Direct/Interconnect)
Node - 1SMP (16-64 cores)
IO Channels (Direct/Interconnect)
Node - 2
IO Channels (Direct/Interconnect)
Node - N10Gb Ethernet
Node Interconnect
SMP (16-64 cores)SMP (16-64 cores)
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.13
Scalable Storage for Data Warehousing
020406080
100120
Exec
utio
n Tim
e (s)
1-node
TPC-HA2-2 Query
ext3 GPFS-SNC GPFS
Goal: Provide scale-out storage capability for warehousing
Motivation: Exploit file system features to accelerate database layer Performance
Linearly scale out with the number of nodes Replication: Can saturate Infiniband and 10 GbE networks
Benefits: Client-side Caching, Disaster Recovery, Tiering MapReduce techniques
Cost: Lower than a SAN configuration
SMP (4-16 cores)
IO Channels (Direct/Interconnect)
BCU - 1SMP (4-16 cores)
IO Channels (Direct/Interconnect)
BCU - 2SMP (4-16 cores)
IO Channels (Direct/Interconnect)
BCU - NEthernet/Infiniband
Node Interconnect
Scale-out
0
200
400
600
800
1000
Seq
writ
e
Seq
read
Rand
om R
ead
Rand
om W
rite
Back
war
d Re
ad
Reco
rd re
writ
e
Strid
e Re
ad
gpfs-san gpfs-snc
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.14
EncryptionSecure Erase
Yes No
GuaranteedImmutability
Yes No
Other Attributes
Latency/IOPs
I/O Characteristics Seq
Protocols Block WebFile
RPO=0 No DR
None
Cost Tolerance Low LowestLower
ArchiveIndex
ArchiveContent
Logical Protection
Physical Protection
General PurposeNAS Extension
Video,Docs, etc.
Production Compute
<10ms, high IOPs >1s, low IOPs
Random
RPO=hours
Periodic Backup Versioning of Objects
Storage Clouds: Wide Variety
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.15
Tier-1 Storage for Compute Cloud
Based on Generic Compute Cloud Data ingested into Compute Cloud onto
GPFS-SNC Cluster for staging Referred to as “repository”
Parallel Copy into analytics GPFS-SNC cluster for processing
Analytics works in parallel on the data using Compute Cloud resources
Results of analytics put back into staging area for customer verification
Customer
ComputeCloud
Repository: Data, Results8 VM, 4 TB GPFS-SNC Cluster
Analytics Cloud8 VM, 4 TB GPFS-SNC Cluster
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.16
Image-deployment with GPFS-SNC
Advantages of GPFS-SNC No distinct storage and
compute nodes: better cloud utilization
Automated replication for failure recovery
Automatic scalability Intelligent VM placement Instance deployment time
reduced from ~6 min to 8 sec with GPFS-SNC and semantic caching
IDE IDE IDE IDE
iSCSIiSCSI
Curr
ent a
rch
IDE+GPFS
IDE+GPFS
IDE+GPFS
IDE+GPFS
Prop
osed
arc
h
IDE = Image Deployment Engine
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
17
VMware ESXi
Virtual Machines
OS
App
Local HDD/SSD
OS
App
OS
AppGPFS-SNC
VMware ESXi
Virtual Machines
OS
App
Local HDD/SSD
OS
App
OS
AppGPFS-SNC
VMware ESXi
Virtual Machines
OS
App
Local HDD/SSD
OS
App
OS
AppGPFS-SNC
POSIX Map-Reduce Disaster Recovery
SANStorage
Multi-Tiered Storage with automated policy-based migration
InfosphereStreams
InfosphereBigInsights
Scaling VMWare with local disks
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.18
GPFS Cluster H1 GPFS Cluster H2
GPFS Cluster A2GPFS Cluster A1
GPFS Cluster Z1
A1 A2 H2H1 H2 A1Z1 H1 H2
Global View Service
DC-H DC-Z
DC-A
WSS / IA MdlwrGPFS-SNC
Put Green
Duplicate
Local Catalog Local Catalog
Local Catalog
Get Red
Local Catalog
Local Catalog
A1 A2 Z1
Put Purple
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS / IA MdlwrGPFS-SNC
WSS Client
FOCUS: Fixed content archive cloud
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Futures
Scale and performance: 10,000+ nodes Recovery: Scale with number of nodes Dynamic caching and tiering: data where its needed Space Efficiency: Coding, compression and de-
duplication Advanced workload isolation