glusterfs internals and - red hat€¦ · glusterfs internals and directions jeff darcy principal...
Post on 20-May-2020
24 Views
Preview:
TRANSCRIPT
GlusterFS Internals and
Directions
Jeff Darcy Principal Engineer, Red Hat13 June, 2013
GlusterFSis not
a filesystem
Wait . . . what?
● GlusterFS is a scalable general purpose storage platform
● We handle common storage tasks● cluster management and configuration● data distribution and replication● common control and data structures
● That platform can be used many different ways
Interface Possibilities
qemu
NFS
SMB
Hadoop
FUSE
Cinder
Swift (UFO)
Files Blocks
Objects
libgfapi
Whatever
IP RDMA
Transports
files BD
Back ends
DB
OpenStack and GlusterFS – Current Integration
Glance Images
NovaNodes
SwiftObjects
Cinder Data
Glance Data
Swift Data
Swift API
Storage Server
Storage Server
Storage Server…
KVM
KVM
KVM
…
● Separate Compute and Storage Pools
● GlusterFS directly provides Swift object service
● Integration with Keystone● GeoReplication for multi-site
support● Swift data also available via
other protocols● Supports non-OpenStack use in
addition to OpenStack use
Logical View Physical View
OpenStack and GlusterFS - Future Direction
HadoopGuest
OtherGuest
...
Host
GlusterGuest
HadoopGuest
OtherGuest
...
Host
GlusterGuest
NovaCompute
Nodes
Open Stack and GlusterFS - Future Direction
● POC based on proposed OpenStack FaaS (File as a Service) proposal
● Cinder-like virtual NAS service● Tenant-specific file shares● Hypervisor mediated for security
● Avoid exposing servers to Quantum tenant network ● Optional multi-site or multi-zone GeoReplication
● FaaS data optionally available to non OpenStack nodes
● Initial focus on Linux guest
● Windows (SMB) and NFS shares also under consideration
Making Hard Stuff Easier
● Distributed filesystems are notoriously hard to set up● multiple experts for multiple weeks is “normal”
● How about four CLI commands?● probe peer, create volume, start volume, mount
● We handle cluster membership, process management, port mapping, dynamic configuration changes, etc.
● add/remove nodes on the fly● add/remove features on the fly● rolling upgrade
Q: How Do We Do It?
Distribution
Replication
...
RPC Server
...
Local Storage
LocalFS
FUSE
...
libgfapi
RPC Client
one of... ...plus all of... ...plus all of...
A: Modularity!
Deep Dive: Distribution
Distribution
Replication
...
RPC Server
...
Local Storage
LocalFS
FUSE
...
libgfapi
RPC Client
Elastic Hashing
Server A
Server BServer C
File X
File Y
● Deterministic mapping: object hash → server
Adding a Node
Server A
Server BServer C
File X
File Y
● Minimize reassignment when server set changes
Server D
Rebalancing
● Goal: optimal layout with minimal data movement
● Greatly improved algorithms in 3.4
Future: Tiering and Topology Awareness
● General deterministic matching function: file attributes to storage attributes
● Currently both attributes are hashes, but...● file attribute could be account ID, age, ...● storage attribute could be disk type (SSD), replication
level, ...● either could be an arbitrary tag
● Rebalance etc. “just work” regardless
● Algorithms can be stacked on top of one another
Tiering ExampleVolume
(select by path)
SSDs(random)
Replicated(random)
Development(random)
Production(select by age)
Deep Dive: Replication
Distribution
Replication
...
RPC Server
...
Local Storage
LocalFS
FUSE
...
libgfapi
RPC Client
Replicated Writes
Client
Server A
Server B
lock xattr+ write xattr- unlock
● Many optimizations avoid the lock/xattr ops● especially for sequential writes
● Still synchronous● don't try this on a high-latency network
Self Heal
● Generation 1: on demand
● Generation 2: full manual scan
● Generation 3: parallel, automatic repair● index based● GlusterFS 3.3, RHS 2.0
● Future: journal based● even more precise (i.e. faster)● lower overhead
Split Brain
Server A Server B
Client 1 Client 2
write“foo”
write“bar”
networkpartition
Split Brain (continued)
● In 3.3: basic quorum enforcement● client side, replica-set level● poor approach for N=2
● In 3.4: advanced quorum enforcement● server side, cluster level
● In 3.5: hyper-advanced (?) quorum enforcement● volume level● arbiters (best approach for N=2)
Access Methods (past)
Distribution
Replication
...
RPC Client
FUSE
NFS
Samba
Swift
HadoopHadoop
Access Methods (present)
Distribution
Replication
...
RPC Client
FUSE
NFS
Samba
Swift
Hadoop
qemu libgfapi
Access Methods (future)
Distribution
Replication
...
RPC Client
FUSE
NFS
Samba
Swift
HadoopHadoop
qemu libgfapi
Your API
What is libgfapi?
● User-space library for accessing data in GlusterFS
● Filesystem-like API
● Runs in application process● no FUSE, no copies, no context switches● ...but same volfiles, translators, etc.
● Could be used for Apache/nginx modules, MPI I/O (maybe), Ganesha, etc. ad infinitum
● BTW it's usable from Python too :)
Translator API
● If libgfapi isn't enough, you can write your own translators (including glupy for Python)
● Most of what we already do is in translators
● It's a public (though not well documented) API● “Translator 101” series, forge.gluster.org
● Translators are right in the I/O path
● Current examples: encryption, erasure coding
● Other possibilities: dedup/compression, format translation, indexing
http://www.gluster.org
● Modularity makes it all possible
● Expect:● OpenStackHadoopOpenStackHadoop...
● marketing made me say that
● more front-end protocols● more back-end storage options● more functionality within the I/O path● more performance enhancements
● Make the storage system you want
top related