reliable replicated file systems with · pdf filereliable replicated file systems with...

Reliable Replicated File Systems with GlusterFS

John Sellens

jsellens@syonex.com

@jsellens

USENIX LISA 28, 2014

November 14, 2014

Notes PDF at http://www.syonex.com/notes/

Reliable Replicated File Systems with GlusterFS

Contents

Preamble and Introduction 2

Setting Up GlusterFS Servers 8

Mounting on Clients 20

Managing, Monitoring, Fixing 25

Wrap Up 33

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Preamble and Introduction

Overview

• Network Attached Storage is handy to have in many cases

– And sometimes we have limited budgets

• GlusterFS provides a scalable NAS system

– On “normal” systems and hardware

• An introduction to GlusterFS and its uses

• And how to implement and maintain a GlusterFS file service

Notes:

• http://www.gluster.org/

• We’re not going to cover everything in this Mini Tutorial session

– But it should get you started

– In time for mid-afternoon break!

• Both USENIX and I will very much appreciate your feedback — please fill

out the evaluation form

Solving a Problem

• Needed to replace a small but reliable network file service

– Expanding the existing service wasn’t going to work

• Wanted something comprehensive but comprehensible

• Needed Posix filesystem semantics, and NFS

• Wanted something that would let me sleep at night

• GlusterFS seemed a good fit

– Supported by RedHat, NFS, CIFS, . . .

– User space, on top of regular filesystem

Notes:

• I have a small hosting infrastructure that I like to implement reliably

• Red Hat Storage Server is a supported GlusterFS implementation

Alternatives I Was Less Enthused About

• Block replication – DRBD, HAST

– Not transparent – hard to look and confirm consistency

– Hard to expand, Limited to two server nodes

• Object stores – Ceph, Hadoop, etc.

– No need for shared block devices for KVMs, etc

– Not always Posix and NFS

• Others – MooseFS, Lustre, etc.

– Some needed separate meta-data server(s)

– Some had single master servers

Notes:

• I was running HAST on FreeBSD, and tried (and failed) to expand it

– Partly due to old hardware I was using

Why I Like GlusterFS

• Can run on just two servers – all functions on both

• Sits on top of a standard filesystem (ext3, xfs)

– Files in GlusterFS volumes are visible as normal files

– So if everything fails very badly, I can likely copy the files out

– Easy to compare replicated copies of files for consistency

• Fits nicely with CentOS which I tend to use

– NFS server support means that my existing FreeBSD boxes

would work “just fine”

Notes:

• I like to be both simple-minded and paranoid

– So being able to check and copy if need be was appealing

Hardware – Don’t Use Your Old Junk

• I have some old 32-bit machines

– Bad, bad idea

• These days, code doesn’t seem to be tested well on 32 bit

• GlusterFS inodes (or equivalent) are 64 bits

– Which doesn’t sit well with 32 bit NFS clients

• In theory 32 bit should work, in practice it’s at least annoying

• 26 Yes! but 25 No!

Notes:

• This is not just GlusterFS related

• My old 32 bit FreeBSD HAST systems started misbehaving when I tried

to update and expand

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Setting Up GlusterFS Servers

Set Up Some Servers

• Ordinary servers with ordinary storage

– All the “normal” speed/reliability questions

– I’ll suggest CentOS 7 (or 6)

• Leave unallocated space to use for GlusterFS

• Separate storage network?

– Traffic and security

• Dedicated servers for storage?

– Likely want storage servers to be static and dedicated

Notes:

• Since RedHat does the development, it’s pretty likely that GlusterFS will

work well on CentOS

– Should work on Fedora and Debian as well, if you’re that way in-

clined

• GlusterFS 3.6 likely to have FreeBSD and MacOS support (I hope)

https://forums.freebsd.org/viewtopic.php?t=46923

• And of course, it should go without saying, but make sure NTP and DNS

and networking are working properly.

RAID on the Servers?

• GlusterFS hardware failures “should be” non-disruptive

• RAID should provide better I/O performance

– Especially hardware RAID with cache

• Re-building/silvering an entire server for a disk failure is boring

– Overall storage performance will suffer in the meantime

– A second failure might be a big problem

• Small general purpose deployment?

– Use good servers and suitable RAID

• Other situations may suit non-RAID

– Lots of servers, more than 2 replicas, etc.

Notes:

• Configuration management should mean that a server rebuild is “easy”

– Your mileage may vary

• Remember that a failed disk means lots of I/O and time to repair, and

you’re vulnerable to other failures while rebuilding

Networks and Security

• GlusterFS has limited security and access controls

– Assumption: all servers and networks are friendly

• A separate storage network may be prudent

– glusterfs mounts need to reach gluster peer addresses

– NFS mounts by default are available on all interfaces

• Generally you want to isolate GlusterFS traffic if you can

– Firewalls, subnets, iptables, . . .

Notes:

• I have very limited experience trying to contain GlusterFS

• If you’re using only glusterfs mounts an isolated network would be useful

– For performance and “containment”

IPs and Addressing

• Generally you will want fixed and floating addresses

• GlusterFS peers need to talk to each other

• glusterfs mounts need to find one peer then talk to the others

– First peer provides details of the volumes and peers

• NFS and CIFS mounts want floating service addresses

– Active/passive mounts need just one

– Active/active mounts need more

• CTDB is recommended for IP address manipulation

Notes:

• With two servers, I have 6 addresses total

– Management addresses

– Storage network peer addresses

– Floating addresses that are normally one per server

• More on CTDB later on slide 19

Installing GlusterFS

• Use the standard gluster.org repositories

– See notes

• Install withyum install glusterfs-serverservice glusterd startchkconfig glusterd on

• orapt-get install glusterfs-server

• Current version is 3.6.1

Notes:

• Versions – use 3.5.x

– I seemed to have less reliable/stable behaviour with 3.4

• Everything is under the download link at

http://download.gluster.org/pub/gluster/glusterfs/LATEST/

• CentOS:

wget -P /etc/yum.repos.d \http://download.gluster.org/pub/gluster/ \glusterfs/LATEST/CentOS/glusterfs-epel.repo

• Debian – see

http://download.gluster.org/pub/gluster/ \glusterfs/3.5/LATEST/Debian/wheezy/README

A Little Terminology

• A set of GlusterFS servers is a Trusted Storage Pool

– Members of a pool are peers of each other

• A GlusterFS filesystem is a Volume

• Volumes are composed of storage Bricks

• Volumes can be three types, and most combinations

– Distributed – different files are on different bricks

– Striped – (very large) files are split across bricks

– Replicated – two or more copies on different bricks

• Distributed Replicated – more servers than replicas

• A Sub-Volume is a replica set within a Volume

Notes:

• Distributed provides no redundancy

– Though you might have RAID disks on servers

– But you’re still in trouble if a server goes down

Set Up the Peers

• All servers in a pool need to know each othernode1# gluster peer probe node2

• Doesn’t hurt to do this (I think it’s optional)node2# gluster peer probe node1

• And make sure they are talking:node1# gluster peer status

– That only lists the other peer(s)

• List the servers in a poolnode1# gluster pool list

Set Us Up the Brick

• A brick is just a directory in an OS filesystem

• One brick per filesystem

– Disk storage dedicated to a volume

– /data/gluster/volname/brickN/brick

• Could have multiple bricks in a filesystem

– Disk storage shared between volumes

– /data/gluster/disk1/volname/brickN

• Don’t want a brick to be a filesystem mount point

– Big problems if underlying storage not mounted

• Multiple volumes? Use the latter for better utilization

Notes:

• XFS is the suggested filesystem to use

• A suggested naming convention for bricks:

http://www.gluster.org/community/documentation/

index.php/HowTos:Brick_naming_conventions

• With disk mount points, and multiple bricks per OS filesystem, one Glus-

terFS volume can use up space and “fill up” other volumes

• With multiple bricks per OS filesystem, it’s harder to know which gluster

volume is using up space – df shows the same for all volumes

• Depends on your use case

– One big volume or multiple volumes for different purposes

– Will volumes shrink, or only grow?

– Is it convenient to have multiple OS disk partitions?

Sizing Up a Brick

• How big should a brick (partition) be?

• One brick using all space on a server is easy to create

– But harder to move or replace if needed

• Consider using bricks of manageable size e.g. 500GB, 1TB

– Will likely be easier to migrate/replace if needed

– Of course, if you have a lot of storage, a zillion bricks might

be difficult

• Keep more space free than is on any one server?

Notes:

• I think there are some subtleties here that aren’t quite so obvious

• And might be worth a thought or two before you commit yourself to a

storage layout that will be hard to change

Create a Volume

• Volume creation is straightforwardnode1# gluster volume create vol1 replica 2 \node1:/data/glusterfs/disk1/vol1/brick1 \node2:/data/glusterfs/disk1/vol1/brick1 \node1:/data/glusterfs/disk2/vol1/brick2 \node2:/data/glusterfs/disk2/vol1/brick2

node1# gluster volume startnode1# gluster volume info vol1node1# mount -t glusterfs localhost:/vol1 /mntnode1# showmount -e node2

• Replicas are across the first two bricks, and next two

• Name things sensibly now, save your brain later

Notes:

• Each brick will now have a .glusterfs directory

• Adding files or directories to the volume causes them to show up in the

bricks of one of the replicated pairs

• You can look, but do not touch

– Only change a volume through a mount

– Never my modifying a brick directly

• Likely best to stick with the built-in NFS server

• You can set options on a volume with

gluster volume set volname option value

• If you’re silly (like me) and have 32 bit NFS clients:

gluster volume set volname \nfs.enable-ino32 on

IP Addresses and CTDB

• CTDB is a clustered TDB database built for Samba

• Includes IP address failover

• Set up CTDB on each node – /etc/ctdb/nodes

• Manage public IPs – /etc/ctdb/public_addresses

• Needs a shared private directory for locks, etc.

• Starts/stops Samba

• Active/active with DNS round robin

Notes:

• Setup is fairly easy – follow these pages

http://www.gluster.org/community/documentation/index.php/CTDBhttp://wiki.samba.org/index.php/CTDB_Setuphttp://ctdb.samba.org/

Reliable Replicated File Systems with GlusterFS Mounting on Clients

Mounting on Clients

Native Mount or NFS?

• Many small files, mostly read?

– e.g. a web server?

– Use NFS client

• Write heavy load?

– Use native gluster client

• Client not Linux?

– Use NFS client

– Or CIFS if Windows client

Notes:

• http://www.gluster.org/documentation/Technical_FAQ/

Gluster Native Mount

• Install glusterfs-fuse or glusterfs-clientclient# mount -t glusterfs ghost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• Gluster client gets volume info

• Then uses the peer names used when adding bricks

– So a gluster client must have access to the storage network

• Client handles if nodes disappear

Notes:

• mount.glusterfs(8) does not mention all the mount options

• In particular, the option backupvolfile-server=node2might be

useful, if you don’t use public/floating IPs

NFS Mount

• Like any other NFS mountclient# mount glusterhost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• NFS talks to that IP/hostname

– So an NFS client need not have access to the storage

network

• NFS must use TCP, not UDP

• Failover should be handled by CTDB IP switch

– But a planned outage might pre-plan and adjust the mount

CIFS Mounts

• Similar to NFS mounts

– Use public/floating IP’s name

• Need to configure Samba as appropriate on the serversclustering = yesidmap backend = tdb2private dir = /gluster/shared/lock

• CTDB will start/stop Samba

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Managing, Monitoring, Fixing

Ongoing Management

• When all is going well, there’s not much to do

• Monitor filespace usage and other normal things

• Gluster monitoring – check for

– Processes running

– All bricks connected

– Free space

– Volume heal info

• Lots of logs in /var/log/glusterfs

• Note well: GlusterFS, like RAID, is not a backup

Notes:

• I use check_glusterfs by Mark Ruys, mark.ruys@peercode.nl

http://exchange.nagios.org/directory/Plugins/

System-Metrics/File-System/GlusterFS-checks/details

• I run it as root via SNMP

• Unsynced entries (from heal info) are normally 0, but when busy there

can be transitory unsynced entries

– My gluster volumes are not heavy write

– You may see more unsynced

Command Line Stuff

• The gluster command is the primary toolnode1# gluster volume info vol1node1# gluster volume log rotate vol1node1# gluster volume status vol1node1# gluster volume heal vol1 infonode1# gluster help

• The volume heal subcommands provide info on consistency

– And can trigger a heal action

Adding More Space

• Expanding the underlying filesystem provides more space

– But likely want to keep things consistent across servers

• And of course you can add bricksnode1# gluster volume add-brick vol1 \node1:/path/brick2 node2:/path/brick2

node1# gluster volume rebalance vol1 start

• Note that you must add bricks in multiple of replica count

– Each new pair is a replica pair, just like for create

• Increase replica count by setting new count and adding enough

bricks

Notes:

• If you have a replica with bricks of different sizes, you may be wasting

• You don’t have to add-brick on a particular node, any server that

knows about the volume should likely work fine

– I’m just a creature of habit

• But you can’t reduce the replica count . . .

– At least, I don’t think you can reduce the replica count

• A rebalance could be useful if file deletions have left bricks (sub-volumes)

unbalanced

Removing Space

• Remove bricks with start, status, commitnode1# gluster volume remove-brick vol1 \node1:/path/brick1 node2:/path/brick1 start

• Replace start with status for progress

• When complete, run commit

• For replicated volumes, you have to remove all the bricks of a

sub-volume at the same time

Notes:

• This of course is never needed, because space needs never decrease

Replacing or Moving a Brick

• Move a brick with replace-brick

node1# gluster volume replace-brick vol1 \node1:/path/brick1 node2:/path/brick1 start

• Start, status, commit like remove-brick

• If you’re adding a third server to a pool with replicas

– Should be able to shuffle bricks to the desired result

– Or, if there’s extra space, add and remove bricks

• If a brick is dead, you may need commit force

– With RAID, this is less of a problem . . .

Notes:

• The Red Hat manual suggests that this is much more complicated

• This is a nice description of adding a third server

http://joejulian.name/blog/

how-to-expand-glusterfs-replicated-clusters-by-one-server/

Taking a Node Out of Service

• In theory it should be simplenode1# ctdb disablenode1# service gluster stop

• In practice, you might want to manually move NFS clients first

• Clients with native gluster mounts should be “just fine”

• On restart, volumes should “self-heal”

Notes:

• I’m paranoid about time for an NFS client to notice a new server

Split Brain Problems

• With multiple servers (more than 2), useful to setnode1# gluster volume set all \cluster.server-quorum-ratio 51%

node1# gluster volume set VOLNAME \cluster.server-quorum-type server

• With two nodes, could add a 3rd “dummy” node with no storage

• If heal info reports unsync’d entriesnode1# gluster volume heal VOLNAME

• Sometimes a client-side “stat” of affected file can fix things

– Or a copy and move back

Notes:

• Default quorum ratio is more than 50

– Or so the docs seem to say

• The Red Hat Storage Administration Guide has a nice discussion

– And lots of details on recovery

• Fixing split brain:

https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md

• Remember: do not modify bricks directly!

Reliable Replicated File Systems with GlusterFS Wrap Up

Wrap Up

We Haven’t Talked About

• GlusterFS has many features and options

• Snapshots

• Geo-Replication

• Object storage – OpenStack Storage (Swift)

• Quotas

Notes:

• We’ve tried to hit the key areas to get started with Gluster

• We didn’t cover everything

• Hopefully you’ve learned some of the more interesting aspects

• And can apply them in your own implementations

Where to Get Gluster Help

• gluster.org web site has a lot of links

– Mailing lists, IRC, . . .

• Quick Start Guide

• Red Hat Storage documentation is pretty good

• HowTo page

• GLusterFS Administrator Guide

Notes:

• GlusterFS documentation is currently a bit disjointed

• http://www.gluster.org/

• http://www.gluster.org/documentation/quickstart/index.html

• Administrator Guide is currently a link to a github repository of markdown

• https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/

• http://www.gluster.org/documentation/howto/HowTo/

And Finally!

• Please take the time to fill out the tutorial evaluations

– The tutorial evaluations help USENIX offer the best possible

tutorial programs

– Comments, suggestions, criticisms gratefully accepted

– All evaluations are carefully reviewed, by USENIX and by the

presenter (me!)

• Feel free to contact me directly if you have any unanswered

questions, either now, or later: jsellens@syonex.com

• Questions? Comments?

• Thank you for attending!

Notes:

• Thank you for taking this tutorial, and I hope that it was (and will be)

informative and useful for you.

• I would be very interested in your feedback, positive or negative, and sug-

gestions for additional things to include in future versions of this tutorial,

on the comment form, here at the conference, or later by email.

reliable replicated file systems with · pdf filereliable replicated file systems with...

Documents

gluster webinar: introduction to glusterfs

integrating glusterfs & supporting storage array offloads...

qemu-glusterfs integration › en › latest › ...enabling...

high availability soa app with glusterfs

qemu-glusterfs integration

petascale cloud storage with glusterfs

the future of glusterfs and gluster.org

glusterfs and big data

glusterfs challenges and futures · glusterfs challenges...

gluster webinar: introduction to glusterfs v3.3

glusterfs and openstack

marian marinov clusters with glusterfs

red hat storage - introduction to glusterfs

intro to glusterfs webinar - august 2011

clusters with glusterfs

clusters de stockage : glusterfs

bncf digital long term preservation: glusterfs

glusterd-2 - fosdem...glusterfs terms peer/node/server - a...

glusterfs documentation

glusterfs - guug.de · pdf file• further information and...