reliable replicated file systems with · pdf filereliable replicated file systems with...

Post on 31-Jan-2018

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Reliable Replicated File Systems with GlusterFS

John Sellens

jsellens@syonex.com

@jsellens

USENIX LISA 28, 2014

November 14, 2014

Notes PDF at http://www.syonex.com/notes/

Reliable Replicated File Systems with GlusterFS

Contents

Preamble and Introduction 2

Setting Up GlusterFS Servers 8

Mounting on Clients 20

Managing, Monitoring, Fixing 25

Wrap Up 33

c©2014 John Sellens USENIX LISA 28, 2014 1

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Preamble and Introduction

c©2014 John Sellens USENIX LISA 28, 2014 2

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Overview

• Network Attached Storage is handy to have in many cases

– And sometimes we have limited budgets

• GlusterFS provides a scalable NAS system

– On “normal” systems and hardware

• An introduction to GlusterFS and its uses

• And how to implement and maintain a GlusterFS file service

c©2014 John Sellens USENIX LISA 28, 2014 3

Notes:

• http://www.gluster.org/

• We’re not going to cover everything in this Mini Tutorial session

– But it should get you started

– In time for mid-afternoon break!

• Both USENIX and I will very much appreciate your feedback — please fill

out the evaluation form

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Solving a Problem

• Needed to replace a small but reliable network file service

– Expanding the existing service wasn’t going to work

• Wanted something comprehensive but comprehensible

• Needed Posix filesystem semantics, and NFS

• Wanted something that would let me sleep at night

• GlusterFS seemed a good fit

– Supported by RedHat, NFS, CIFS, . . .

– User space, on top of regular filesystem

c©2014 John Sellens USENIX LISA 28, 2014 4

Notes:

• I have a small hosting infrastructure that I like to implement reliably

• Red Hat Storage Server is a supported GlusterFS implementation

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Alternatives I Was Less Enthused About

• Block replication – DRBD, HAST

– Not transparent – hard to look and confirm consistency

– Hard to expand, Limited to two server nodes

• Object stores – Ceph, Hadoop, etc.

– No need for shared block devices for KVMs, etc

– Not always Posix and NFS

• Others – MooseFS, Lustre, etc.

– Some needed separate meta-data server(s)

– Some had single master servers

c©2014 John Sellens USENIX LISA 28, 2014 5

Notes:

• I was running HAST on FreeBSD, and tried (and failed) to expand it

– Partly due to old hardware I was using

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Why I Like GlusterFS

• Can run on just two servers – all functions on both

• Sits on top of a standard filesystem (ext3, xfs)

– Files in GlusterFS volumes are visible as normal files

– So if everything fails very badly, I can likely copy the files out

– Easy to compare replicated copies of files for consistency

• Fits nicely with CentOS which I tend to use

– NFS server support means that my existing FreeBSD boxes

would work “just fine”

c©2014 John Sellens USENIX LISA 28, 2014 6

Notes:

• I like to be both simple-minded and paranoid

– So being able to check and copy if need be was appealing

Reliable Replicated File Systems with GlusterFS Preamble and Introduction

Hardware – Don’t Use Your Old Junk

• I have some old 32-bit machines

– Bad, bad idea

• These days, code doesn’t seem to be tested well on 32 bit

• GlusterFS inodes (or equivalent) are 64 bits

– Which doesn’t sit well with 32 bit NFS clients

• In theory 32 bit should work, in practice it’s at least annoying

• 26 Yes! but 25 No!

c©2014 John Sellens USENIX LISA 28, 2014 7

Notes:

• This is not just GlusterFS related

• My old 32 bit FreeBSD HAST systems started misbehaving when I tried

to update and expand

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Setting Up GlusterFS Servers

c©2014 John Sellens USENIX LISA 28, 2014 8

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Up Some Servers

• Ordinary servers with ordinary storage

– All the “normal” speed/reliability questions

– I’ll suggest CentOS 7 (or 6)

• Leave unallocated space to use for GlusterFS

• Separate storage network?

– Traffic and security

• Dedicated servers for storage?

– Likely want storage servers to be static and dedicated

c©2014 John Sellens USENIX LISA 28, 2014 9

Notes:

• Since RedHat does the development, it’s pretty likely that GlusterFS will

work well on CentOS

– Should work on Fedora and Debian as well, if you’re that way in-

clined

• GlusterFS 3.6 likely to have FreeBSD and MacOS support (I hope)

https://forums.freebsd.org/viewtopic.php?t=46923

• And of course, it should go without saying, but make sure NTP and DNS

and networking are working properly.

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

RAID on the Servers?

• GlusterFS hardware failures “should be” non-disruptive

• RAID should provide better I/O performance

– Especially hardware RAID with cache

• Re-building/silvering an entire server for a disk failure is boring

– Overall storage performance will suffer in the meantime

– A second failure might be a big problem

• Small general purpose deployment?

– Use good servers and suitable RAID

• Other situations may suit non-RAID

– Lots of servers, more than 2 replicas, etc.

c©2014 John Sellens USENIX LISA 28, 2014 10

Notes:

• Configuration management should mean that a server rebuild is “easy”

– Your mileage may vary

• Remember that a failed disk means lots of I/O and time to repair, and

you’re vulnerable to other failures while rebuilding

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Networks and Security

• GlusterFS has limited security and access controls

– Assumption: all servers and networks are friendly

• A separate storage network may be prudent

– glusterfs mounts need to reach gluster peer addresses

– NFS mounts by default are available on all interfaces

• Generally you want to isolate GlusterFS traffic if you can

– Firewalls, subnets, iptables, . . .

c©2014 John Sellens USENIX LISA 28, 2014 11

Notes:

• I have very limited experience trying to contain GlusterFS

• If you’re using only glusterfs mounts an isolated network would be useful

– For performance and “containment”

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

IPs and Addressing

• Generally you will want fixed and floating addresses

• GlusterFS peers need to talk to each other

• glusterfs mounts need to find one peer then talk to the others

– First peer provides details of the volumes and peers

• NFS and CIFS mounts want floating service addresses

– Active/passive mounts need just one

– Active/active mounts need more

• CTDB is recommended for IP address manipulation

c©2014 John Sellens USENIX LISA 28, 2014 12

Notes:

• With two servers, I have 6 addresses total

– Management addresses

– Storage network peer addresses

– Floating addresses that are normally one per server

• More on CTDB later on slide 19

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Installing GlusterFS

• Use the standard gluster.org repositories

– See notes

• Install withyum install glusterfs-serverservice glusterd startchkconfig glusterd on

• orapt-get install glusterfs-server

• Current version is 3.6.1

c©2014 John Sellens USENIX LISA 28, 2014 13

Notes:

• Versions – use 3.5.x

– I seemed to have less reliable/stable behaviour with 3.4

• Everything is under the download link at

http://download.gluster.org/pub/gluster/glusterfs/LATEST/

• CentOS:

wget -P /etc/yum.repos.d \http://download.gluster.org/pub/gluster/ \glusterfs/LATEST/CentOS/glusterfs-epel.repo

• Debian – see

http://download.gluster.org/pub/gluster/ \glusterfs/3.5/LATEST/Debian/wheezy/README

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

A Little Terminology

• A set of GlusterFS servers is a Trusted Storage Pool

– Members of a pool are peers of each other

• A GlusterFS filesystem is a Volume

• Volumes are composed of storage Bricks

• Volumes can be three types, and most combinations

– Distributed – different files are on different bricks

– Striped – (very large) files are split across bricks

– Replicated – two or more copies on different bricks

• Distributed Replicated – more servers than replicas

• A Sub-Volume is a replica set within a Volume

c©2014 John Sellens USENIX LISA 28, 2014 14

Notes:

• Distributed provides no redundancy

– Though you might have RAID disks on servers

– But you’re still in trouble if a server goes down

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Up the Peers

• All servers in a pool need to know each othernode1# gluster peer probe node2

• Doesn’t hurt to do this (I think it’s optional)node2# gluster peer probe node1

• And make sure they are talking:node1# gluster peer status

– That only lists the other peer(s)

• List the servers in a poolnode1# gluster pool list

c©2014 John Sellens USENIX LISA 28, 2014 15

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Set Us Up the Brick

• A brick is just a directory in an OS filesystem

• One brick per filesystem

– Disk storage dedicated to a volume

– /data/gluster/volname/brickN/brick

• Could have multiple bricks in a filesystem

– Disk storage shared between volumes

– /data/gluster/disk1/volname/brickN

• Don’t want a brick to be a filesystem mount point

– Big problems if underlying storage not mounted

• Multiple volumes? Use the latter for better utilization

c©2014 John Sellens USENIX LISA 28, 2014 16

Notes:

• XFS is the suggested filesystem to use

• A suggested naming convention for bricks:

http://www.gluster.org/community/documentation/

index.php/HowTos:Brick_naming_conventions

• With disk mount points, and multiple bricks per OS filesystem, one Glus-

terFS volume can use up space and “fill up” other volumes

• With multiple bricks per OS filesystem, it’s harder to know which gluster

volume is using up space – df shows the same for all volumes

• Depends on your use case

– One big volume or multiple volumes for different purposes

– Will volumes shrink, or only grow?

– Is it convenient to have multiple OS disk partitions?

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Sizing Up a Brick

• How big should a brick (partition) be?

• One brick using all space on a server is easy to create

– But harder to move or replace if needed

• Consider using bricks of manageable size e.g. 500GB, 1TB

– Will likely be easier to migrate/replace if needed

– Of course, if you have a lot of storage, a zillion bricks might

be difficult

• Keep more space free than is on any one server?

c©2014 John Sellens USENIX LISA 28, 2014 17

Notes:

• I think there are some subtleties here that aren’t quite so obvious

• And might be worth a thought or two before you commit yourself to a

storage layout that will be hard to change

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

Create a Volume

• Volume creation is straightforwardnode1# gluster volume create vol1 replica 2 \node1:/data/glusterfs/disk1/vol1/brick1 \node2:/data/glusterfs/disk1/vol1/brick1 \node1:/data/glusterfs/disk2/vol1/brick2 \node2:/data/glusterfs/disk2/vol1/brick2

node1# gluster volume startnode1# gluster volume info vol1node1# mount -t glusterfs localhost:/vol1 /mntnode1# showmount -e node2

• Replicas are across the first two bricks, and next two

• Name things sensibly now, save your brain later

c©2014 John Sellens USENIX LISA 28, 2014 18

Notes:

• Each brick will now have a .glusterfs directory

• Adding files or directories to the volume causes them to show up in the

bricks of one of the replicated pairs

• You can look, but do not touch

– Only change a volume through a mount

– Never my modifying a brick directly

• Likely best to stick with the built-in NFS server

• You can set options on a volume with

gluster volume set volname option value

• If you’re silly (like me) and have 32 bit NFS clients:

gluster volume set volname \nfs.enable-ino32 on

Reliable Replicated File Systems with GlusterFS Setting Up GlusterFS Servers

IP Addresses and CTDB

• CTDB is a clustered TDB database built for Samba

• Includes IP address failover

• Set up CTDB on each node – /etc/ctdb/nodes

• Manage public IPs – /etc/ctdb/public_addresses

• Needs a shared private directory for locks, etc.

• Starts/stops Samba

• Active/active with DNS round robin

c©2014 John Sellens USENIX LISA 28, 2014 19

Notes:

• Setup is fairly easy – follow these pages

http://www.gluster.org/community/documentation/index.php/CTDBhttp://wiki.samba.org/index.php/CTDB_Setuphttp://ctdb.samba.org/

Reliable Replicated File Systems with GlusterFS Mounting on Clients

Mounting on Clients

c©2014 John Sellens USENIX LISA 28, 2014 20

Reliable Replicated File Systems with GlusterFS Mounting on Clients

Native Mount or NFS?

• Many small files, mostly read?

– e.g. a web server?

– Use NFS client

• Write heavy load?

– Use native gluster client

• Client not Linux?

– Use NFS client

– Or CIFS if Windows client

c©2014 John Sellens USENIX LISA 28, 2014 21

Notes:

• http://www.gluster.org/documentation/Technical_FAQ/

Reliable Replicated File Systems with GlusterFS Mounting on Clients

Gluster Native Mount

• Install glusterfs-fuse or glusterfs-clientclient# mount -t glusterfs ghost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• Gluster client gets volume info

• Then uses the peer names used when adding bricks

– So a gluster client must have access to the storage network

• Client handles if nodes disappear

c©2014 John Sellens USENIX LISA 28, 2014 22

Notes:

• mount.glusterfs(8) does not mention all the mount options

• In particular, the option backupvolfile-server=node2might be

useful, if you don’t use public/floating IPs

Reliable Replicated File Systems with GlusterFS Mounting on Clients

NFS Mount

• Like any other NFS mountclient# mount glusterhost:/vol1 /mnt

• Use a public/floating IP/hostname for the mount

• NFS talks to that IP/hostname

– So an NFS client need not have access to the storage

network

• NFS must use TCP, not UDP

• Failover should be handled by CTDB IP switch

– But a planned outage might pre-plan and adjust the mount

c©2014 John Sellens USENIX LISA 28, 2014 23

Reliable Replicated File Systems with GlusterFS Mounting on Clients

CIFS Mounts

• Similar to NFS mounts

– Use public/floating IP’s name

• Need to configure Samba as appropriate on the serversclustering = yesidmap backend = tdb2private dir = /gluster/shared/lock

• CTDB will start/stop Samba

c©2014 John Sellens USENIX LISA 28, 2014 24

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Managing, Monitoring, Fixing

c©2014 John Sellens USENIX LISA 28, 2014 25

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Ongoing Management

• When all is going well, there’s not much to do

• Monitor filespace usage and other normal things

• Gluster monitoring – check for

– Processes running

– All bricks connected

– Free space

– Volume heal info

• Lots of logs in /var/log/glusterfs

• Note well: GlusterFS, like RAID, is not a backup

c©2014 John Sellens USENIX LISA 28, 2014 26

Notes:

• I use check_glusterfs by Mark Ruys, mark.ruys@peercode.nl

http://exchange.nagios.org/directory/Plugins/

System-Metrics/File-System/GlusterFS-checks/details

• I run it as root via SNMP

• Unsynced entries (from heal info) are normally 0, but when busy there

can be transitory unsynced entries

– My gluster volumes are not heavy write

– You may see more unsynced

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Command Line Stuff

• The gluster command is the primary toolnode1# gluster volume info vol1node1# gluster volume log rotate vol1node1# gluster volume status vol1node1# gluster volume heal vol1 infonode1# gluster help

• The volume heal subcommands provide info on consistency

– And can trigger a heal action

c©2014 John Sellens USENIX LISA 28, 2014 27

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Adding More Space

• Expanding the underlying filesystem provides more space

– But likely want to keep things consistent across servers

• And of course you can add bricksnode1# gluster volume add-brick vol1 \node1:/path/brick2 node2:/path/brick2

node1# gluster volume rebalance vol1 start

• Note that you must add bricks in multiple of replica count

– Each new pair is a replica pair, just like for create

• Increase replica count by setting new count and adding enough

bricks

c©2014 John Sellens USENIX LISA 28, 2014 28

Notes:

• If you have a replica with bricks of different sizes, you may be wasting

space

• You don’t have to add-brick on a particular node, any server that

knows about the volume should likely work fine

– I’m just a creature of habit

• But you can’t reduce the replica count . . .

– At least, I don’t think you can reduce the replica count

• A rebalance could be useful if file deletions have left bricks (sub-volumes)

unbalanced

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Removing Space

• Remove bricks with start, status, commitnode1# gluster volume remove-brick vol1 \node1:/path/brick1 node2:/path/brick1 start

• Replace start with status for progress

• When complete, run commit

• For replicated volumes, you have to remove all the bricks of a

sub-volume at the same time

c©2014 John Sellens USENIX LISA 28, 2014 29

Notes:

• This of course is never needed, because space needs never decrease

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Replacing or Moving a Brick

• Move a brick with replace-brick

node1# gluster volume replace-brick vol1 \node1:/path/brick1 node2:/path/brick1 start

• Start, status, commit like remove-brick

• If you’re adding a third server to a pool with replicas

– Should be able to shuffle bricks to the desired result

– Or, if there’s extra space, add and remove bricks

• If a brick is dead, you may need commit force

– With RAID, this is less of a problem . . .

c©2014 John Sellens USENIX LISA 28, 2014 30

Notes:

• The Red Hat manual suggests that this is much more complicated

• This is a nice description of adding a third server

http://joejulian.name/blog/

how-to-expand-glusterfs-replicated-clusters-by-one-server/

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Taking a Node Out of Service

• In theory it should be simplenode1# ctdb disablenode1# service gluster stop

• In practice, you might want to manually move NFS clients first

• Clients with native gluster mounts should be “just fine”

• On restart, volumes should “self-heal”

c©2014 John Sellens USENIX LISA 28, 2014 31

Notes:

• I’m paranoid about time for an NFS client to notice a new server

Reliable Replicated File Systems with GlusterFS Managing, Monitoring, Fixing

Split Brain Problems

• With multiple servers (more than 2), useful to setnode1# gluster volume set all \cluster.server-quorum-ratio 51%

node1# gluster volume set VOLNAME \cluster.server-quorum-type server

• With two nodes, could add a 3rd “dummy” node with no storage

• If heal info reports unsync’d entriesnode1# gluster volume heal VOLNAME

• Sometimes a client-side “stat” of affected file can fix things

– Or a copy and move back

c©2014 John Sellens USENIX LISA 28, 2014 32

Notes:

• Default quorum ratio is more than 50

– Or so the docs seem to say

• The Red Hat Storage Administration Guide has a nice discussion

– And lots of details on recovery

• Fixing split brain:

https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md

• Remember: do not modify bricks directly!

Reliable Replicated File Systems with GlusterFS Wrap Up

Wrap Up

c©2014 John Sellens USENIX LISA 28, 2014 33

Reliable Replicated File Systems with GlusterFS Wrap Up

We Haven’t Talked About

• GlusterFS has many features and options

• Snapshots

• Geo-Replication

• Object storage – OpenStack Storage (Swift)

• Quotas

c©2014 John Sellens USENIX LISA 28, 2014 34

Notes:

• We’ve tried to hit the key areas to get started with Gluster

• We didn’t cover everything

• Hopefully you’ve learned some of the more interesting aspects

• And can apply them in your own implementations

Reliable Replicated File Systems with GlusterFS Wrap Up

Where to Get Gluster Help

• gluster.org web site has a lot of links

– Mailing lists, IRC, . . .

• Quick Start Guide

• Red Hat Storage documentation is pretty good

• HowTo page

• GLusterFS Administrator Guide

c©2014 John Sellens USENIX LISA 28, 2014 35

Notes:

• GlusterFS documentation is currently a bit disjointed

• http://www.gluster.org/

• http://www.gluster.org/documentation/quickstart/index.html

• Administrator Guide is currently a link to a github repository of markdown

files

• https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/

• http://www.gluster.org/documentation/howto/HowTo/

Reliable Replicated File Systems with GlusterFS Wrap Up

And Finally!

• Please take the time to fill out the tutorial evaluations

– The tutorial evaluations help USENIX offer the best possible

tutorial programs

– Comments, suggestions, criticisms gratefully accepted

– All evaluations are carefully reviewed, by USENIX and by the

presenter (me!)

• Feel free to contact me directly if you have any unanswered

questions, either now, or later: jsellens@syonex.com

• Questions? Comments?

• Thank you for attending!

c©2014 John Sellens USENIX LISA 28, 2014 36

Notes:

• Thank you for taking this tutorial, and I hope that it was (and will be)

informative and useful for you.

• I would be very interested in your feedback, positive or negative, and sug-

gestions for additional things to include in future versions of this tutorial,

on the comment form, here at the conference, or later by email.

top related