-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
1/18
Ganeti walk-through
Documents Ganeti version 2.10
Contents
Ganeti walk-through
Introduction
Cluster creation
Running a burn-in
Instance operations
Creation
Accessing instances
Removal
Recovering from hardware failures
Recovering from node failureRe-adding a node to the cluster
Disk failures
Common cluster problems
Instance status
Unallocated DRBD minors
Orphan volumes
N+1 errors
Network issues
Migration problems
In use disks at instance shutdown
LUXI version mismatch
Introduction
This document serves as a more example-oriented guide to Ganeti; while the administration
guide shows a conceptual approach, here you will find a step-by-step example to managing
instances and the cluster.
Our simulated, example cluster will have three machines, named node1, node2, node3. Note
that in real life machines will usually have FQDNs but here we use short names for brevity.
We will use a secondary network for replication data, 192.0.2.0/24 , with nodes having the
last octet the same as their index. The cluster name will be example-cluster . All nodes have
the same simulated hardware configuration, two disks of 750GB, 32GB of memory and 4
CPUs.
On this cluster, we will create up to seven instances, named instance1to instance7.
Cluster creation
Follow the Ganeti installation tutorialdocument and prepare the nodes. Then its time to
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
2/18
initialise the cluster:
$ gnt-clusterinit-s192.0.2.1--enabled-hypervisors=xen-pvmexample-cluster$
The creation was fine. Lets check that one node we have is functioning correctly:
$ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinstnode1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
$ gnt-clusterverifyMon Oct 26 02:08:51 2009 * Verifying global settingsMon Oct 26 02:08:51 2009 * Gathering data (1 nodes)Mon Oct 26 02:08:52 2009 * Verifying node statusMon Oct 26 02:08:52 2009 * Verifying instance statusMon Oct 26 02:08:52 2009 * Verifying orphan volumesMon Oct 26 02:08:52 2009 * Verifying remaining instancesMon Oct 26 02:08:52 2009 * Verifying N+1 Memory redundancyMon Oct 26 02:08:52 2009 * Other NotesMon Oct 26 02:08:52 2009 * Hooks Results
$
Since this proceeded correctly, lets add the other two nodes:
$ gnt-nodeadd-s192.0.2.2node2
-- WARNING --Performing this operation is going to replace the ssh daemon keypairon the target machine (node2) with the ones of the current oneand grant full intra-cluster ssh root access to/from it
Unable to verify hostkey of host xen-devi-5.fra.corp.google.com:
f7:. Do you want to accept it?y/[n]/?: yMon Oct 26 02:11:53 2009 Authentication to node2 via public key failed, trying pasroot password:Mon Oct 26 02:11:54 2009 - INFO: Node will be a master candidate$ gnt-nodeadd-s192.0.2.3node3-- WARNING --Performing this operation is going to replace the ssh daemon keypairon the target machine (node3) with the ones of the current oneand grant full intra-cluster ssh root access to/from it
Mon Oct 26 02:12:43 2009 - INFO: Node will be a master candidate
Checking the cluster status again:
$ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinstnode1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0node2 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 0$ gnt-clusterverifyMon Oct 26 02:15:14 2009 * Verifying global settingsMon Oct 26 02:15:14 2009 * Gathering data (3 nodes)Mon Oct 26 02:15:16 2009 * Verifying node status
Mon Oct 26 02:15:16 2009 * Verifying instance statusMon Oct 26 02:15:16 2009 * Verifying orphan volumesMon Oct 26 02:15:16 2009 * Verifying remaining instancesMon Oct 26 02:15:16 2009 * Verifying N+1 Memory redundancyMon Oct 26 02:15:16 2009 * Other Notes
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
3/18
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
4/18
- Failing over instances * instance instance1
* instance instance5 * Submitted job ID(s) 179, 180, 181, 182, 183 waiting for job 179 for instance1 - Migrating instances
* instance instance1 migration and migration cleanup * instance instance5 migration and migration cleanup * Submitted job ID(s) 184, 185, 186, 187, 188 waiting for job 184 for instance1 - Exporting and re-importing instances * instance instance1 export to node node3
remove instance import from node3 to node1, node2
remove export * instance instance5 export to node node1 remove instance import from node1 to node2, node3 remove export * Submitted job ID(s) 196, 197, 198, 199, 200 waiting for job 196 for instance1
- Reinstalling instances * instance instance1 reinstall without passing the OS reinstall specifying the OS
* instance instance5 reinstall without passing the OS reinstall specifying the OS * Submitted job ID(s) 203, 204, 205, 206, 207 waiting for job 203 for instance1 - Rebooting instances * instance instance1 reboot with type 'hard' reboot with type 'soft'
reboot with type 'full' * instance instance5 reboot with type 'hard' reboot with type 'soft' reboot with type 'full'
* Submitted job ID(s) 208, 209, 210, 211, 212 waiting for job 208 for instance1 - Adding and removing disks * instance instance1 adding a disk removing last disk
* instance instance5 adding a disk
removing last disk * Submitted job ID(s) 213, 214, 215, 216, 217
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
5/18
waiting for job 213 for instance1 - Adding and removing NICs * instance instance1
adding a NIC removing last NIC * instance instance5 adding a NIC
removing last NIC * Submitted job ID(s) 218, 219, 220, 221, 222 waiting for job 218 for instance1 - Activating/deactivating disks * instance instance1 activate disks when online activate disks when offline deactivate disks (when offline) * instance instance5
activate disks when online activate disks when offline
deactivate disks (when offline) * Submitted job ID(s) 223, 224, 225, 226, 227 waiting for job 223 for instance1 - Stopping and starting instances * instance instance1 * instance instance5 * Submitted job ID(s) 230, 231, 232, 233, 234
waiting for job 230 for instance1 - Removing instances * instance instance1
* instance instance5 * Submitted job ID(s) 235, 236, 237, 238, 239 waiting for job 235 for instance1 $
You can see in the above what operations the burn-in does. Ideally, the burn-in log would
proceed successfully through all the steps and end cleanly, without throwing errors.
Instance operations
Creation
At this point, Ganeti and the hardware seems to be functioning correctly, so well follow up
with creating the instances manually:
$ gnt-instanceadd-tdrbd-odebootstrap-s256minstance1Mon Oct 26 04:06:52 2009 - INFO: Selected nodes for instance instance1 via iallocaMon Oct 26 04:06:53 2009 * creating instance disks...
Mon Oct 26 04:06:57 2009 adding instance instance1 to cluster configMon Oct 26 04:06:57 2009 - INFO: Waiting for instance instance1 to sync disks.Mon Oct 26 04:06:57 2009 - INFO: - device disk/0: 20.00% done, 4 estimated secondsMon Oct 26 04:07:01 2009 - INFO: Instance instance1's disks are in sync.Mon Oct 26 04:07:01 2009 creating os for instance instance1 on node node2
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
6/18
Mon Oct 26 04:07:01 2009 * running the instance OS create scripts...Mon Oct 26 04:07:14 2009 * starting instance...$ gnt-instanceadd-tdrbd-odebootstrap-s256m-nnode1:node2instance2Mon Oct 26 04:11:37 2009 * creating instance disks...
Mon Oct 26 04:11:40 2009 adding instance instance2 to cluster configMon Oct 26 04:11:41 2009 - INFO: Waiting for instance instance2 to sync disks.Mon Oct 26 04:11:41 2009 - INFO: - device disk/0: 35.40% done, 1 estimated secondsMon Oct 26 04:11:42 2009 - INFO: - device disk/0: 58.50% done, 1 estimated secondsMon Oct 26 04:11:43 2009 - INFO: - device disk/0: 86.20% done, 0 estimated seconds
Mon Oct 26 04:11:44 2009 - INFO: - device disk/0: 92.40% done, 0 estimated secondsMon Oct 26 04:11:44 2009 - INFO: - device disk/0: 97.00% done, 0 estimated secondsMon Oct 26 04:11:44 2009 - INFO: Instance instance2's disks are in sync.Mon Oct 26 04:11:44 2009 creating os for instance instance2 on node node1Mon Oct 26 04:11:44 2009 * running the instance OS create scripts...Mon Oct 26 04:11:57 2009 * starting instance...$
The above shows one instance created via an iallocator script, and one being created with
manual node assignment. The other three instances were also created and now its time to
check them:
$ gnt-instancelistInstance Hypervisor OS Primary_node Status Memoryinstance1 xen-pvm debootstrap node2 running 128Minstance2 xen-pvm debootstrap node1 running 128Minstance3 xen-pvm debootstrap node1 running 128Minstance4 xen-pvm debootstrap node3 running 128Minstance5 xen-pvm debootstrap node2 running 128M
Accessing instances
Accessing an instances console is easy:
$ gnt-instanceconsoleinstance2[ 0.000000] Bootdata ok (command line is root=/dev/sda1 ro)[ 0.000000] Linux version 2.6
[ 0.000000] BIOS-provided physical RAM map:[ 0.000000] Xen: 0000000000000000 - 0000000008800000 (usable)[13138176.018071] Built 1 zonelists. Total pages: 34816[13138176.018074] Kernel command line: root=/dev/sda1 ro[13138176.018694] Initializing CPU#0Checking file systems...fsck 1.41.3 (12-Oct-2008)
done.Setting kernel variables (/etc/sysctl.conf)...done.Mounting local filesystems...done.Activating swapfile swap...done.
Setting up networking....Configuring network interfaces...done.Setting console screen modes and fonts.INIT: Entering runlevel: 2Starting enhanced syslogd: rsyslogd.Starting periodic command scheduler: crond.
Debian GNU/Linux 5.0 instance2 tty1
instance2 login:
At this moment you can login to the instance and, after configuring the network (and doing
this on all instances), we can check their connectivity:
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
7/18
$ fpinginstance{1..5}instance1 is aliveinstance2 is aliveinstance3 is aliveinstance4 is aliveinstance5 is alive$
Removal
Removing unwanted instances is also easy:
$ gnt-instanceremoveinstance5This will remove the volumes of the instance instance5 (includingmirrors), thus removing all the data of the instance. Continue?y/[n]/?: y$
Recovering from hardware failures
Recovering from node failure
We are now left with four instances. Assume that at this point, node3, which has one primary
and one secondary instance, crashes:
$ gnt-nodeinfonode3Node name: node3
primary ip: 198.51.100.1 secondary ip: 192.0.2.3 master candidate: True drained: False offline: False primary for instances: - instance4
secondary for instances: - instance1$ fpingnode3node3 is unreachable
At this point, the primary instance of that node (instance4) is down, but the secondaryinstance (instance1) is not affected except it has lost disk redundancy:
$ fpinginstance{1,4}
instance1 is aliveinstance4 is unreachable$
If we try to check the status of instance4 via the instance info command, it fails because it
tries to contact node3 which is down:
$ gnt-instanceinfoinstance4Failure: command execution error:Error checking node node3: Connection failed (113: No route to host)
$
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
8/18
So we need to mark node3 as being offline, and thus Ganeti wont talk to it anymore:
$ gnt-nodemodify-Oyes-fnode3Mon Oct 26 04:34:12 2009 - WARNING: Not enough master candidates (desired 10, newMon Oct 26 04:34:15 2009 - WARNING: Communication failure to node node3: ConnectioModified node node3
- offline -> True- master_candidate -> auto-demotion due to offline
$
And now we can failover the instance:
$ gnt-instancefailoverinstance4Failover will happen to image instance4. This requires a shutdown ofthe instance. Continue?y/[n]/?: yMon Oct 26 04:35:34 2009 * checking disk consistency between source and targetFailure: command execution error:Disk disk/0 is degraded on target node, aborting failover.$ gnt-instancefailover--ignore-consistencyinstance4
Failover will happen to image instance4. This requires a shutdown ofthe instance. Continue?y/[n]/?: yMon Oct 26 04:35:47 2009 * checking disk consistency between source and targetMon Oct 26 04:35:47 2009 * shutting down instance on source nodeMon Oct 26 04:35:47 2009 - WARNING: Could not shutdown instance instance4 on nodeMon Oct 26 04:35:47 2009 * deactivating the instance's disks on source nodeMon Oct 26 04:35:47 2009 - WARNING: Could not shutdown block device disk/0 on nodeMon Oct 26 04:35:47 2009 * activating the instance's disks on target nodeMon Oct 26 04:35:47 2009 - WARNING: Could not prepare block device disk/0 on nodeMon Oct 26 04:35:48 2009 * starting the instance on the target node$
Note in our first attempt, Ganeti refused to do the failover since it wasnt sure what is the
status of the instances disks. We pass the --ignore-consistency flag and then we can
failover:
$ gnt-instancelistInstance Hypervisor OS Primary_node Status Memoryinstance1 xen-pvm debootstrap node2 running 128Minstance2 xen-pvm debootstrap node1 running 128Minstance3 xen-pvm debootstrap node1 running 128M
instance4 xen-pvm debootstrap node1 running 128M$
But at this point, both instance1 and instance4 are without disk redundancy:
$ gnt-instanceinfoinstance1Instance name: instance1UUID: 45173e82-d1fa-417c-8758-7d582ab7eef4Serial number: 2Creation time: 2009-10-26 04:06:57Modification time: 2009-10-26 04:07:14State: configured to be up, actual state is up Nodes:
- primary: node2 - secondaries: node3 Operating system: debootstrap Allocated network port: None Hypervisor: xen-pvm
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
9/18
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
10/18
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
11/18
Disk failures
A disk failure is simpler than a full node failure. First, a single disk failure should not cause
data-loss for any redundant instance; only the performance of some instances might be
reduced due to more network traffic.
Let take the cluster status in the above listing, and check what volumes are in use:
$ gnt-nodevolumes-ophys,instancenode2PhysDev Instance/dev/sdb1 instance4/dev/sdb1 instance4/dev/sdb1 instance1/dev/sdb1 instance1/dev/sdb1 instance3/dev/sdb1 instance3/dev/sdb1 instance2/dev/sdb1 instance2
$
You can see that all instances on node2 have logical volumes on /dev/sdb1. Lets simulate a
disk failure on that disk:
$ sshnode2# on node2
$ echooffline>/sys/block/sdb/device/state$ vgs /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 4096 at 750153695232: Input/output error /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'. Couldn't find all physical volumes for volume group xenvg. /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'. Couldn't find all physical volumes for volume group xenvg. Volume group xenvg not found$
At this point, the node is broken and if we are to examine instance2 we get (simplified output
shown):
$ gnt-instanceinfoinstance2
Instance name: instance2State: configured to be up, actual state is up Nodes: - primary: node1 - secondaries: node2 Disks:
- disk/0: drbd8, size 256M on primary: /dev/drbd0 (147:0) in sync, status ok on secondary: /dev/drbd1 (147:1) in sync, status *DEGRADED* *MISSING DISK*
This instance has a secondary only on node2. Lets verify a primary instance of node2:
$ gnt-instanceinfoinstance1Instance name: instance1State: configured to be up, actual state is up
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
12/18
Nodes: - primary: node2 - secondaries: node1 Disks:
- disk/0: drbd8, size 256M on primary: /dev/drbd0 (147:0) in sync, status *DEGRADED* *MISSING DISK* on secondary: /dev/drbd3 (147:3) in sync, status ok$ gnt-instanceconsoleinstance1
Debian GNU/Linux 5.0 instance1 tty1
instance1 login: rootLast login: Tue Oct 27 01:24:09 UTC 2009 on tty1instance1:~# date > testinstance1:~# syncinstance1:~# cat testTue Oct 27 01:25:20 UTC 2009instance1:~# dmesg|tail[5439785.235448] NET: Registered protocol family 15[5439785.235489] 802.1Q VLAN Support v1.8 Ben Greear
[5439785.235495] All bugs added by David S. Miller [5439785.235517] XENBUS: Device with no driver: device/console/0
[5439785.236576] kjournald starting. Commit interval 5 seconds[5439785.236588] EXT3-fs: mounted filesystem with ordered data mode.[5439785.236625] VFS: Mounted root (ext3 filesystem) readonly.[5439785.236663] Freeing unused kernel memory: 172k freed[5439787.533779] EXT3 FS on sda1, internal journal[5440655.065431] eth0: no IPv6 routers presentinstance1:~#
As you can see, the instance is running fine and doesnt see any disk issues. It is now time
to fix node2 and re-establish redundancy for the involved instances.
Note: For Ganeti 2.0 we need to fix manually the volume group on node2 by runningvgreduce --removemissing xenvg
$ gnt-noderepair-storagenode2lvm-vgxenvgMon Oct 26 18:14:03 2009 Repairing storage unit 'xenvg' on node2 ...$ sshnode2vgsVG #PV #LV #SN Attr VSize VFreexenvg 1 8 0 wz--n- 673.84G 673.84G$
This has removed the bad disk from the volume group, which is now left with only one PV.
We can now replace the disks for the involved instances:
$ foriininstance{1..4};dognt-instancereplace-disks-a$i;doneMon Oct 26 18:15:38 2009 Replacing disk(s) 0 for instance1Mon Oct 26 18:15:38 2009 STEP 1/6 Check device existenceMon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node1Mon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node2Mon Oct 26 18:15:38 2009 - INFO: Checking volume groupsMon Oct 26 18:15:38 2009 STEP 2/6 Check peer consistencyMon Oct 26 18:15:38 2009 - INFO: Checking disk/0 consistency on node node1Mon Oct 26 18:15:39 2009 STEP 3/6 Allocate new storage
Mon Oct 26 18:15:39 2009 - INFO: Adding storage on node2 for disk/0Mon Oct 26 18:15:39 2009 STEP 4/6 Changing drbd configurationMon Oct 26 18:15:39 2009 - INFO: Detaching disk/0 drbd from local storageMon Oct 26 18:15:40 2009 - INFO: Renaming the old LVs on the target nodeMon Oct 26 18:15:40 2009 - INFO: Renaming the new LVs on the target node
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
13/18
Mon Oct 26 18:15:40 2009 - INFO: Adding new mirror component on node2Mon Oct 26 18:15:41 2009 STEP 5/6 Sync devicesMon Oct 26 18:15:41 2009 - INFO: Waiting for instance instance1 to sync disks.Mon Oct 26 18:15:41 2009 - INFO: - device disk/0: 12.40% done, 9 estimated seconds
Mon Oct 26 18:15:50 2009 - INFO: Instance instance1's disks are in sync.Mon Oct 26 18:15:50 2009 STEP 6/6 Removing old storageMon Oct 26 18:15:50 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:15:52 2009 Replacing disk(s) 0 for instance2Mon Oct 26 18:15:52 2009 STEP 1/6 Check device existence
Mon Oct 26 18:16:01 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:01 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:16:02 2009 Replacing disk(s) 0 for instance3Mon Oct 26 18:16:02 2009 STEP 1/6 Check device existenceMon Oct 26 18:16:09 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:09 2009 - INFO: Remove logical volumes for disk/0Mon Oct 26 18:16:10 2009 Replacing disk(s) 0 for instance4Mon Oct 26 18:16:10 2009 STEP 1/6 Check device existence
Mon Oct 26 18:16:18 2009 STEP 6/6 Removing old storageMon Oct 26 18:16:18 2009 - INFO: Remove logical volumes for disk/0
$
As this point, all instances should be healthy again.
Note: Ganeti 2.0 doesnt have the -aoption to replace-disks, so for it you have to run the
loop twice, once over primary instances with argument -pand once secondary instances
with argument -s, but otherwise the operations are similar:
$ gnt-instancereplace-disks-pinstance1
$ foriininstance{2..4};dognt-instancereplace-disks-s$i;done
Common cluster problems
There are a number of small issues that might appear on a cluster that can be solved easily
as long as the issue is properly identified. For this exercise we will consider the case of
node3, which was broken previously and re-added to the cluster without reinstallation.
Running cluster verify on the cluster reports:
$ gnt-clusterverifyMon Oct 26 18:30:08 2009 * Verifying global settingsMon Oct 26 18:30:08 2009 * Gathering data (3 nodes)Mon Oct 26 18:30:10 2009 * Verifying node statusMon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 0 is in useMon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 1 is in useMon Oct 26 18:30:10 2009 * Verifying instance statusMon Oct 26 18:30:10 2009 - ERROR: instance instance4: instance should not run on
Mon Oct 26 18:30:10 2009 * Verifying orphan volumesMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-7916Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03afMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03afMon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-7916Mon Oct 26 18:30:10 2009 * Verifying remaining instances
Mon Oct 26 18:30:10 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:30:10 2009 * Other NotesMon Oct 26 18:30:10 2009 * Hooks Results
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
14/18
$
Instance status
As you can see, instance4has a copy running on node3, because we forced the failover
when node3 failed. This case is dangerous as the instance will have the same IP and MAC
address, wreaking havoc on the network environment and anyone who tries to use it.
Ganeti doesnt directly handle this case. It is recommended to logon to node3 and run:
$ xmdestroyinstance4
Unallocated DRBD minors
There are still unallocated DRBD minors on node3. Again, these are not handled by Ganeti
directly and need to be cleaned up via DRBD commands:
$ sshnode3# on node 3
$ drbdsetup/dev/drbd0down$ drbdsetup/dev/drbd1down$
Orphan volumes
At this point, the only remaining problem should be the so-called orphanvolumes. This can
happen also in the case of an aborted disk-replace, or similar situation where Ganeti was notable to recover automatically. Here you need to remove them manually via LVM commands:
$ sshnode3# on node3
$ lvremovexenvgDo you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data" successfully remDo you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d
Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta" successfully remDo you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data" successfully remDo you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta" successfully remnode3#
At this point cluster verify shouldnt complain anymore:
$ gnt-clusterverifyMon Oct 26 18:37:51 2009 * Verifying global settingsMon Oct 26 18:37:51 2009 * Gathering data (3 nodes)Mon Oct 26 18:37:53 2009 * Verifying node statusMon Oct 26 18:37:53 2009 * Verifying instance status
Mon Oct 26 18:37:53 2009 * Verifying orphan volumesMon Oct 26 18:37:53 2009 * Verifying remaining instancesMon Oct 26 18:37:53 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:37:53 2009 * Other NotesMon Oct 26 18:37:53 2009 * Hooks Results
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
15/18
$
N+1 errors
Since redundant instances in Ganeti have a primary/secondary model, it is needed to leave
aside on each node enough memory so that if one of its peer node fails, all the secondary
instances that have that node as primary can be relocated. More specifically, if instance2 hasnode1 as primary and node2 as secondary (and node1 and node2 do not have any other
instances in this layout), then it means that node2 must have enough free memory so that if
node1 fails, we can failover instance2 without any other operations (for reducing the
downtime window). Lets increase the memory of the current instances to 4G, and add three
new instances, two on node2:node3 with 8GB of RAM and one on node1:node2, with 12GB
of RAM (numbers chosen so that we run out of memory):
$ gnt-instancemodify-Bmemory=4Ginstance1Modified instance instance1- be/maxmem -> 4096- be/minmem -> 4096Please don't forget that these parameters take effect only at the next start of the$ gnt-instancemodify
$ gnt-instanceadd-tdrbd-nnode2:node3-s512m-Bmemory=8G-odebootstrapinsta$ gnt-instanceadd-tdrbd-nnode2:node3-s512m-Bmemory=8G-odebootstrapinsta
$ gnt-instanceadd-tdrbd-nnode1:node2-s512m-Bmemory=8G-odebootstrapinsta$ gnt-instancereboot--allThe reboot will operate on 7 instances.Do you want to continue?
Affected instances: instance1 instance2 instance3 instance4 instance5 instance6 instance7y/[n]/?: ySubmitted jobs 677, 678, 679, 680, 681, 682, 683Waiting for job 677 for instance1...Waiting for job 678 for instance2...
Waiting for job 679 for instance3...
Waiting for job 680 for instance4...Waiting for job 681 for instance5...Waiting for job 682 for instance6...Waiting for job 683 for instance7...
$
We rebooted the instances for the memory changes to have effect. Now the cluster looks
like:
$ gnt-nodelistNode DTotal DFree MTotal MNode MFree Pinst Sinst
node1 1.3T 1.3T 32.0G 1.0G 6.5G 4 1node2 1.3T 1.3T 32.0G 1.0G 10.5G 3 4node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 2
$ gnt-clusterverifyMon Oct 26 18:59:36 2009 * Verifying global settings
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
16/18
Mon Oct 26 18:59:36 2009 * Gathering data (3 nodes)Mon Oct 26 18:59:37 2009 * Verifying node statusMon Oct 26 18:59:37 2009 * Verifying instance statusMon Oct 26 18:59:37 2009 * Verifying orphan volumes
Mon Oct 26 18:59:37 2009 * Verifying remaining instancesMon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancyMon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate inMon Oct 26 18:59:37 2009 * Other NotesMon Oct 26 18:59:37 2009 * Hooks Results
$
The cluster verify error above shows that if node1 fails, node2 will not have enough memory
to failover all primary instances on node1 to it. To solve this, you have a number of options:
try to manually move instances around (but this can become complicated for any
non-trivial cluster)
try to reduce the minimum memory of some instances on the source node of the N+1
failure (in the example above node1): this will allow it to start and be failed
over/migrated with less than its maximum memory
try to reduce the runtime/maximum memory of some instances on the destination nodeof the N+1 failure (in the example above node2) to create additional available node
memory (check the Ganeti administrators guideguide for what Ganeti will and wont
automatically do in regards to instance runtime memory modification)
if Ganeti has been built with the htools package enabled, you can run the hbal tool
which will try to compute an automated cluster solution that complies with the N+1 rule
Network issues
In case a node has problems with the network (usually the secondary network, as problemswith the primary network will render the node unusable for ganeti commands), it will show up
in cluster verify as:
$ gnt-clusterverifyMon Oct 26 19:07:19 2009 * Verifying global settingsMon Oct 26 19:07:19 2009 * Gathering data (3 nodes)Mon Oct 26 19:07:23 2009 * Verifying node statusMon Oct 26 19:07:23 2009 - ERROR: node node1: tcp communication with node 'node3':Mon Oct 26 19:07:23 2009 - ERROR: node node2: tcp communication with node 'node3':
Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node1':
Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node2':Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node3':Mon Oct 26 19:07:23 2009 * Verifying instance statusMon Oct 26 19:07:23 2009 * Verifying orphan volumesMon Oct 26 19:07:23 2009 * Verifying remaining instancesMon Oct 26 19:07:23 2009 * Verifying N+1 Memory redundancyMon Oct 26 19:07:23 2009 * Other NotesMon Oct 26 19:07:23 2009 * Hooks Results$
This shows that both node1 and node2 have problems contacting node3 over the secondary
network, and node3 has problems contacting them. From this output is can be deduced that
since node1 and node2 can communicate between themselves, node3 is the one havingproblems, and you need to investigate its network settings/connection.
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
17/18
Migration problems
Since live migration can sometimes fail and leave the instance in an inconsistent state,
Ganeti provides a --cleanupargument to the migrate command that does:
check on which node the instance is actually running (has the command failed before
or after the actual migration?)
reconfigure the DRBD disks accordingly
It is always safe to run this command as long as the instance has good data on its primary
node (i.e. not showing as degraded). If so, you can simply run:
$ gnt-instancemigrate--cleanupinstance1Instance instance1 will be recovered from a failed migration. Notethat the migration procedure (including cleanup) is **experimental**in this version. This might impact the instance if anything goeswrong. Continue?y/[n]/?: y
Mon Oct 26 19:13:49 2009 Migrating instance instance1Mon Oct 26 19:13:49 2009 * checking where the instance actually runs (if this hangs,Mon Oct 26 19:13:49 2009 * instance confirmed to be running on its primary node (noMon Oct 26 19:13:49 2009 * switching node node1 to secondary modeMon Oct 26 19:13:50 2009 * wait until resync is doneMon Oct 26 19:13:50 2009 * changing into standalone modeMon Oct 26 19:13:50 2009 * changing disks into single-master mode
Mon Oct 26 19:13:50 2009 * wait until resync is doneMon Oct 26 19:13:51 2009 * done$
In use disks at instance shutdown
If you see something like the following when trying to shutdown or deactivate disks for an
instance:
$ gnt-instanceshutdowninstance1Mon Oct 26 19:16:23 2009 - WARNING: Could not shutdown block device disk/0 on node
It most likely means something is holding open the underlying DRBD device. This can be
bad if the instance is not running, as it might mean that there was concurrent access from
both the node and the instance to the disks, but not always (e.g. you could only have had thepartitions activated via kpartx).
To troubleshoot this issue you need to follow standard Linux practices, and pay attention to
the hypervisor being used:
check if (in the above example) /dev/drbd0 on node2 is being mounted somewhere
(cat /proc/mounts)
check if the device is not being used by device mapper itself: dmsetup ls and look for
entries of the form drbd0pX, and if so remove them with either kpartx -d or dmsetup
remove
For Xen, check if its not using the disks itself:
eti walk-through Ganeti 2.10.0~rc1 documentation http://docs.ganeti.org/ganeti/2.10/html/walkthrough.html
f 18 21.1.2014. 19:32
-
8/10/2019 Ganeti walk-through Ganeti 2.10.0~rc1 documentation.pdf
18/18