reliable replicated file systems with · pdf filereliable replicated file systems with...
TRANSCRIPT
Reliable Replicated File Systems with GlusterFS
John Sellens
@jsellens
USENIX LISA 28, 2014
November 14, 2014
Notes PDF at http://www.syonex.com/notes/
Reliable Replicated File Systems with GlusterFS
Contents
Preamble and Introduction 2
Setting Up GlusterFS Servers 8
Mounting on Clients 20
Managing, Monitoring, Fixing 25
Wrap Up 33
c©2014 John Sellens USENIX LISA 28, 2014 1
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Preamble and Introduction
c©2014 John Sellens USENIX LISA 28, 2014 2
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Overview
• Network Attached Storage is handy to have in many cases
– And sometimes we have limited budgets
• GlusterFS provides a scalable NAS system
– On “normal” systems and hardware
• An introduction to GlusterFS and its uses
• And how to implement and maintain a GlusterFS file service
c©2014 John Sellens USENIX LISA 28, 2014 3
Notes:
• http://www.gluster.org/
• We’re not going to cover everything in this Mini Tutorial session
– But it should get you started
– In time for mid-afternoon break!
• Both USENIX and I will very much appreciate your feedback — please fill
out the evaluation form
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Solving a Problem
• Needed to replace a small but reliable network file service
– Expanding the existing service wasn’t going to work
• Wanted something comprehensive but comprehensible
• Needed Posix filesystem semantics, and NFS
• Wanted something that would let me sleep at night
• GlusterFS seemed a good fit
– Supported by RedHat, NFS, CIFS, . . .
– User space, on top of regular filesystem
c©2014 John Sellens USENIX LISA 28, 2014 4
Notes:
• I have a small hosting infrastructure that I like to implement reliably
• Red Hat Storage Server is a supported GlusterFS implementation
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Alternatives I Was Less Enthused About
• Block replication – DRBD, HAST
– Not transparent – hard to look and confirm consistency
– Hard to expand, Limited to two server nodes
• Object stores – Ceph, Hadoop, etc.
– No need for shared block devices for KVMs, etc
– Not always Posix and NFS
• Others – MooseFS, Lustre, etc.
– Some needed separate meta-data server(s)
– Some had single master servers
c©2014 John Sellens USENIX LISA 28, 2014 5
Notes:
• I was running HAST on FreeBSD, and tried (and failed) to expand it
– Partly due to old hardware I was using
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Why I Like GlusterFS
• Can run on just two servers – all functions on both
• Sits on top of a standard filesystem (ext3, xfs)
– Files in GlusterFS volumes are visible as normal files
– So if everything fails very badly, I can likely copy the files out
– Easy to compare replicated copies of files for consistency
• Fits nicely with CentOS which I tend to use
– NFS server support means that my existing FreeBSD boxes
would work “just fine”
c©2014 John Sellens USENIX LISA 28, 2014 6
Notes:
• I like to be both simple-minded and paranoid
– So being able to check and copy if need be was appealing
Reliable Replicated File Systems with GlusterFS Preamble and Introduction
Hardware – Don’t Use Your Old Junk
• I have some old 32-bit machines
– Bad, bad idea
• These days, code doesn’t seem to be tested well on 32 bit
• GlusterFS inodes (or equivalent) are 64 bits
– Which doesn’t sit well with 32 bit NFS clients
• In theory 32 bit should work, in practice it’s at least annoying
• 26 Yes! but 25 No!
c©2014 John Sellens USENIX LISA 28, 2014 7
Notes:
• This is not just GlusterFS related
• My old 32 bit FreeBSD HAST systems started misbehaving when I tried
to update and expand
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Setting Up GlusterFS Servers
c©2014 John Sellens USENIX LISA 28, 2014 8
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Set Up Some Servers
• Ordinary servers with ordinary storage
– All the “normal” speed/reliability questions
– I’ll suggest CentOS 7 (or 6)
• Leave unallocated space to use for GlusterFS
• Separate storage network?
– Traffic and security
• Dedicated servers for storage?
– Likely want storage servers to be static and dedicated
c©2014 John Sellens USENIX LISA 28, 2014 9
Notes:
• Since RedHat does the development, it’s pretty likely that GlusterFS will
work well on CentOS
– Should work on Fedora and Debian as well, if you’re that way in-
clined
• GlusterFS 3.6 likely to have FreeBSD and MacOS support (I hope)
https://forums.freebsd.org/viewtopic.php?t=46923
• And of course, it should go without saying, but make sure NTP and DNS
and networking are working properly.
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
RAID on the Servers?
• GlusterFS hardware failures “should be” non-disruptive
• RAID should provide better I/O performance
– Especially hardware RAID with cache
• Re-building/silvering an entire server for a disk failure is boring
– Overall storage performance will suffer in the meantime
– A second failure might be a big problem
• Small general purpose deployment?
– Use good servers and suitable RAID
• Other situations may suit non-RAID
– Lots of servers, more than 2 replicas, etc.
c©2014 John Sellens USENIX LISA 28, 2014 10
Notes:
• Configuration management should mean that a server rebuild is “easy”
– Your mileage may vary
• Remember that a failed disk means lots of I/O and time to repair, and
you’re vulnerable to other failures while rebuilding
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Networks and Security
• GlusterFS has limited security and access controls
– Assumption: all servers and networks are friendly
• A separate storage network may be prudent
– glusterfs mounts need to reach gluster peer addresses
– NFS mounts by default are available on all interfaces
• Generally you want to isolate GlusterFS traffic if you can
– Firewalls, subnets, iptables, . . .
c©2014 John Sellens USENIX LISA 28, 2014 11
Notes:
• I have very limited experience trying to contain GlusterFS
• If you’re using only glusterfs mounts an isolated network would be useful
– For performance and “containment”
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
IPs and Addressing
• Generally you will want fixed and floating addresses
• GlusterFS peers need to talk to each other
• glusterfs mounts need to find one peer then talk to the others
– First peer provides details of the volumes and peers
• NFS and CIFS mounts want floating service addresses
– Active/passive mounts need just one
– Active/active mounts need more
• CTDB is recommended for IP address manipulation
c©2014 John Sellens USENIX LISA 28, 2014 12
Notes:
• With two servers, I have 6 addresses total
– Management addresses
– Storage network peer addresses
– Floating addresses that are normally one per server
• More on CTDB later on slide 19
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Installing GlusterFS
• Use the standard gluster.org repositories
– See notes
• Install withyum install glusterfs-serverservice glusterd startchkconfig glusterd on
• orapt-get install glusterfs-server
• Current version is 3.6.1
c©2014 John Sellens USENIX LISA 28, 2014 13
Notes:
• Versions – use 3.5.x
– I seemed to have less reliable/stable behaviour with 3.4
• Everything is under the download link at
http://download.gluster.org/pub/gluster/glusterfs/LATEST/
• CentOS:
wget -P /etc/yum.repos.d \http://download.gluster.org/pub/gluster/ \glusterfs/LATEST/CentOS/glusterfs-epel.repo
• Debian – see
http://download.gluster.org/pub/gluster/ \glusterfs/3.5/LATEST/Debian/wheezy/README
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
A Little Terminology
• A set of GlusterFS servers is a Trusted Storage Pool
– Members of a pool are peers of each other
• A GlusterFS filesystem is a Volume
• Volumes are composed of storage Bricks
• Volumes can be three types, and most combinations
– Distributed – different files are on different bricks
– Striped – (very large) files are split across bricks
– Replicated – two or more copies on different bricks
• Distributed Replicated – more servers than replicas
• A Sub-Volume is a replica set within a Volume
c©2014 John Sellens USENIX LISA 28, 2014 14
Notes:
• Distributed provides no redundancy
– Though you might have RAID disks on servers
– But you’re still in trouble if a server goes down
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Set Up the Peers
• All servers in a pool need to know each othernode1# gluster peer probe node2
• Doesn’t hurt to do this (I think it’s optional)node2# gluster peer probe node1
• And make sure they are talking:node1# gluster peer status
– That only lists the other peer(s)
• List the servers in a poolnode1# gluster pool list
c©2014 John Sellens USENIX LISA 28, 2014 15
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Set Us Up the Brick
• A brick is just a directory in an OS filesystem
• One brick per filesystem
– Disk storage dedicated to a volume
– /data/gluster/volname/brickN/brick
• Could have multiple bricks in a filesystem
– Disk storage shared between volumes
– /data/gluster/disk1/volname/brickN
• Don’t want a brick to be a filesystem mount point
– Big problems if underlying storage not mounted
• Multiple volumes? Use the latter for better utilization
c©2014 John Sellens USENIX LISA 28, 2014 16
Notes:
• XFS is the suggested filesystem to use
• A suggested naming convention for bricks:
http://www.gluster.org/community/documentation/
index.php/HowTos:Brick_naming_conventions
• With disk mount points, and multiple bricks per OS filesystem, one Glus-
terFS volume can use up space and “fill up” other volumes
• With multiple bricks per OS filesystem, it’s harder to know which gluster
volume is using up space – df shows the same for all volumes
• Depends on your use case
– One big volume or multiple volumes for different purposes
– Will volumes shrink, or only grow?
– Is it convenient to have multiple OS disk partitions?
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Sizing Up a Brick
• How big should a brick (partition) be?
• One brick using all space on a server is easy to create
– But harder to move or replace if needed
• Consider using bricks of manageable size e.g. 500GB, 1TB
– Will likely be easier to migrate/replace if needed
– Of course, if you have a lot of storage, a zillion bricks might
be difficult
• Keep more space free than is on any one server?
c©2014 John Sellens USENIX LISA 28, 2014 17
Notes:
• I think there are some subtleties here that aren’t quite so obvious
• And might be worth a thought or two before you commit yourself to a
storage layout that will be hard to change
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
Create a Volume
• Volume creation is straightforwardnode1# gluster volume create vol1 replica 2 \node1:/data/glusterfs/disk1/vol1/brick1 \node2:/data/glusterfs/disk1/vol1/brick1 \node1:/data/glusterfs/disk2/vol1/brick2 \node2:/data/glusterfs/disk2/vol1/brick2
node1# gluster volume startnode1# gluster volume info vol1node1# mount -t glusterfs localhost:/vol1 /mntnode1# showmount -e node2
• Replicas are across the first two bricks, and next two
• Name things sensibly now, save your brain later
c©2014 John Sellens USENIX LISA 28, 2014 18
Notes:
• Each brick will now have a .glusterfs directory
• Adding files or directories to the volume causes them to show up in the
bricks of one of the replicated pairs
• You can look, but do not touch
– Only change a volume through a mount
– Never my modifying a brick directly
• Likely best to stick with the built-in NFS server
• You can set options on a volume with
gluster volume set volname option value
• If you’re silly (like me) and have 32 bit NFS clients:
gluster volume set volname \nfs.enable-ino32 on
Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers
IP Addresses and CTDB
• CTDB is a clustered TDB database built for Samba
• Includes IP address failover
• Set up CTDB on each node – /etc/ctdb/nodes
• Manage public IPs – /etc/ctdb/public_addresses
• Needs a shared private directory for locks, etc.
• Starts/stops Samba
• Active/active with DNS round robin
c©2014 John Sellens USENIX LISA 28, 2014 19
Notes:
• Setup is fairly easy – follow these pages
http://www.gluster.org/community/documentation/index.php/CTDBhttp://wiki.samba.org/index.php/CTDB_Setuphttp://ctdb.samba.org/
Reliable Replicated File Systems with GlusterFS Mounting on Clients
Mounting on Clients
c©2014 John Sellens USENIX LISA 28, 2014 20
Reliable Replicated File Systems with GlusterFS Mounting on Clients
Native Mount or NFS?
• Many small files, mostly read?
– e.g. a web server?
– Use NFS client
• Write heavy load?
– Use native gluster client
• Client not Linux?
– Use NFS client
– Or CIFS if Windows client
c©2014 John Sellens USENIX LISA 28, 2014 21
Notes:
• http://www.gluster.org/documentation/Technical_FAQ/
Reliable Replicated File Systems with GlusterFS Mounting on Clients
Gluster Native Mount
• Install glusterfs-fuse or glusterfs-clientclient# mount -t glusterfs ghost:/vol1 /mnt
• Use a public/floating IP/hostname for the mount
• Gluster client gets volume info
• Then uses the peer names used when adding bricks
– So a gluster client must have access to the storage network
• Client handles if nodes disappear
c©2014 John Sellens USENIX LISA 28, 2014 22
Notes:
• mount.glusterfs(8) does not mention all the mount options
• In particular, the option backupvolfile-server=node2might be
useful, if you don’t use public/floating IPs
Reliable Replicated File Systems with GlusterFS Mounting on Clients
NFS Mount
• Like any other NFS mountclient# mount glusterhost:/vol1 /mnt
• Use a public/floating IP/hostname for the mount
• NFS talks to that IP/hostname
– So an NFS client need not have access to the storage
network
• NFS must use TCP, not UDP
• Failover should be handled by CTDB IP switch
– But a planned outage might pre-plan and adjust the mount
c©2014 John Sellens USENIX LISA 28, 2014 23
Reliable Replicated File Systems with GlusterFS Mounting on Clients
CIFS Mounts
• Similar to NFS mounts
– Use public/floating IP’s name
• Need to configure Samba as appropriate on the serversclustering = yesidmap backend = tdb2private dir = /gluster/shared/lock
• CTDB will start/stop Samba
c©2014 John Sellens USENIX LISA 28, 2014 24
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Managing, Monitoring, Fixing
c©2014 John Sellens USENIX LISA 28, 2014 25
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Ongoing Management
• When all is going well, there’s not much to do
• Monitor filespace usage and other normal things
• Gluster monitoring – check for
– Processes running
– All bricks connected
– Free space
– Volume heal info
• Lots of logs in /var/log/glusterfs
• Note well: GlusterFS, like RAID, is not a backup
c©2014 John Sellens USENIX LISA 28, 2014 26
Notes:
• I use check_glusterfs by Mark Ruys, [email protected]
http://exchange.nagios.org/directory/Plugins/
System-Metrics/File-System/GlusterFS-checks/details
• I run it as root via SNMP
• Unsynced entries (from heal info) are normally 0, but when busy there
can be transitory unsynced entries
– My gluster volumes are not heavy write
– You may see more unsynced
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Command Line Stuff
• The gluster command is the primary toolnode1# gluster volume info vol1node1# gluster volume log rotate vol1node1# gluster volume status vol1node1# gluster volume heal vol1 infonode1# gluster help
• The volume heal subcommands provide info on consistency
– And can trigger a heal action
c©2014 John Sellens USENIX LISA 28, 2014 27
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Adding More Space
• Expanding the underlying filesystem provides more space
– But likely want to keep things consistent across servers
• And of course you can add bricksnode1# gluster volume add-brick vol1 \node1:/path/brick2 node2:/path/brick2
node1# gluster volume rebalance vol1 start
• Note that you must add bricks in multiple of replica count
– Each new pair is a replica pair, just like for create
• Increase replica count by setting new count and adding enough
bricks
c©2014 John Sellens USENIX LISA 28, 2014 28
Notes:
• If you have a replica with bricks of different sizes, you may be wasting
space
• You don’t have to add-brick on a particular node, any server that
knows about the volume should likely work fine
– I’m just a creature of habit
• But you can’t reduce the replica count . . .
– At least, I don’t think you can reduce the replica count
• A rebalance could be useful if file deletions have left bricks (sub-volumes)
unbalanced
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Removing Space
• Remove bricks with start, status, commitnode1# gluster volume remove-brick vol1 \node1:/path/brick1 node2:/path/brick1 start
• Replace start with status for progress
• When complete, run commit
• For replicated volumes, you have to remove all the bricks of a
sub-volume at the same time
c©2014 John Sellens USENIX LISA 28, 2014 29
Notes:
• This of course is never needed, because space needs never decrease
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Replacing or Moving a Brick
• Move a brick with replace-brick
node1# gluster volume replace-brick vol1 \node1:/path/brick1 node2:/path/brick1 start
• Start, status, commit like remove-brick
• If you’re adding a third server to a pool with replicas
– Should be able to shuffle bricks to the desired result
– Or, if there’s extra space, add and remove bricks
• If a brick is dead, you may need commit force
– With RAID, this is less of a problem . . .
c©2014 John Sellens USENIX LISA 28, 2014 30
Notes:
• The Red Hat manual suggests that this is much more complicated
• This is a nice description of adding a third server
http://joejulian.name/blog/
how-to-expand-glusterfs-replicated-clusters-by-one-server/
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Taking a Node Out of Service
• In theory it should be simplenode1# ctdb disablenode1# service gluster stop
• In practice, you might want to manually move NFS clients first
• Clients with native gluster mounts should be “just fine”
• On restart, volumes should “self-heal”
c©2014 John Sellens USENIX LISA 28, 2014 31
Notes:
• I’m paranoid about time for an NFS client to notice a new server
Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing
Split Brain Problems
• With multiple servers (more than 2), useful to setnode1# gluster volume set all \cluster.server-quorum-ratio 51%
node1# gluster volume set VOLNAME \cluster.server-quorum-type server
• With two nodes, could add a 3rd “dummy” node with no storage
• If heal info reports unsync’d entriesnode1# gluster volume heal VOLNAME
• Sometimes a client-side “stat” of affected file can fix things
– Or a copy and move back
c©2014 John Sellens USENIX LISA 28, 2014 32
Notes:
• Default quorum ratio is more than 50
– Or so the docs seem to say
• The Red Hat Storage Administration Guide has a nice discussion
– And lots of details on recovery
• Fixing split brain:
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
• Remember: do not modify bricks directly!
Reliable Replicated File Systems with GlusterFS Wrap Up
Wrap Up
c©2014 John Sellens USENIX LISA 28, 2014 33
Reliable Replicated File Systems with GlusterFS Wrap Up
We Haven’t Talked About
• GlusterFS has many features and options
• Snapshots
• Geo-Replication
• Object storage – OpenStack Storage (Swift)
• Quotas
c©2014 John Sellens USENIX LISA 28, 2014 34
Notes:
• We’ve tried to hit the key areas to get started with Gluster
• We didn’t cover everything
• Hopefully you’ve learned some of the more interesting aspects
• And can apply them in your own implementations
Reliable Replicated File Systems with GlusterFS Wrap Up
Where to Get Gluster Help
• gluster.org web site has a lot of links
– Mailing lists, IRC, . . .
• Quick Start Guide
• Red Hat Storage documentation is pretty good
• HowTo page
• GLusterFS Administrator Guide
c©2014 John Sellens USENIX LISA 28, 2014 35
Notes:
• GlusterFS documentation is currently a bit disjointed
• http://www.gluster.org/
• http://www.gluster.org/documentation/quickstart/index.html
• Administrator Guide is currently a link to a github repository of markdown
files
• https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/
• http://www.gluster.org/documentation/howto/HowTo/
Reliable Replicated File Systems with GlusterFS Wrap Up
And Finally!
• Please take the time to fill out the tutorial evaluations
– The tutorial evaluations help USENIX offer the best possible
tutorial programs
– Comments, suggestions, criticisms gratefully accepted
– All evaluations are carefully reviewed, by USENIX and by the
presenter (me!)
• Feel free to contact me directly if you have any unanswered
questions, either now, or later: [email protected]
• Questions? Comments?
• Thank you for attending!
c©2014 John Sellens USENIX LISA 28, 2014 36
Notes:
• Thank you for taking this tutorial, and I hope that it was (and will be)
informative and useful for you.
• I would be very interested in your feedback, positive or negative, and sug-
gestions for additional things to include in future versions of this tutorial,
on the comment form, here at the conference, or later by email.