int eng ilt cmodetrbl exerciseguide
DESCRIPTION
Int Eng Ilt Cmodetrbl ExerciseguideTRANSCRIPT
MODULE 1: KERNEL
Exercise 1: Recovering from a boot loop Time Estimate: 20 minutes
Step Action
1. Log in to the clustershell and execute the following command
cluster1::> cluster show
Node Health Eligibility
--------------------- ------- ------------
cluster1-01 true true
cluster1-02 false true
cluster1-03 true true
cluster1-04 true true
4 entries were displayed.
2. Note that the health of node clusterX-02 is false.
Try and log in to the nodeshell of clusterX-02 to find out the problem.
If unable to access nodeshell of clusterX-02, try and access it through its console.
What do you see?
3. How do you fix this?
MODULE 2: M-HOST
Exercise 1: Fun with mgwd and mroot Time Estimate: 20 minutes
Step Action
1. On a node which does not own epsilon log in as admin to your cluster via console and go into systemshell.
::> set diag
::*> systemshell local
2. Execute the following:
% ps -A|grep mgwd
913 ?? Ss 0:11.76 mgwd -z
2794 p1 DL+ 0:00.00 grep mgwd
The above listing shows that the process id of the running instance of mgwd on this node is 913
Kill mgwd as follows
%sudo kill <pid of mgwd as obtained from above>
3. You see the following? Why?
server closed connection unexpectedly: No such file or directory
login:
Login as admin again as shon below:
server closed connection unexpectedly: No such file or directory
login:admin
Password:
What happens ?
4. You are now in clustershell. Drop to systemshell as follows:
::> set diag
::*> systemshell local
In systemshell execute the following:
% cd /etc
% sudo ./netapp_mroot_unmount
% exit
logout
When would we expect the node to use/need this script?
5. Now you are back in clustershell. Execute the following:
cluster1::> set diag
Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y
cluster1::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
cluster1-01 true true true
cluster1-02 true true false
cluster1-03 true true false
cluster1-04 true true false
4 entries were displayed.
cluster1::*> vol modify -vserver studentX -volume studentX_nfs -size 45M
(volume modify)
Error: command failed: Failed to queue job 'Modify studentX_nfs'. IO error in
local job store
cluster1::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
cluster1-01 false true true
cluster1-02 false true false
cluster1-03 false true false
cluster1-04 false true false
4 entries were displayed.
Do we see a difference in cluster show? If so, why? What’s broken?
6. To fix this without rebooting and without manually re-mounting /mroot restart mgwd.
7. Which phase in the boot process could we see this behavior occurring?
Exercise 2: Configuration backup and recovery
Time Estimate: 40 minutes
Action
1. Run the following commands:
::> set advanced
::*> man system configuration backup create
::*> man system configuration recovery node
::*> man system configuration recovery cluster
::*> system configuration backup show –node nodename
What do each of the commands show?
2. Where in systemshell can you find the files listed above?
3. Create a new system configuration backup of the node and the cluster as follows:
cluster1::*> system configuration backup create -node cluster1-01 -backup-type
node -backup-name cluster1-01.node
[Job 164] Job is queued: Local backup job.
::*> job private show
::*> job private show –id [Job id given as output of the backup create command above]
::*> job private show -id [id as above] -fields uuid
::*> job store show -id [uuid obtained from the command above]
cluster1::*> system configuration backup create -node cluster1-01 -backup-type
cluster -backup-name cluster1-01.cluster
[Job 495] Job is queued: Cluster Backup OnDemand Job.
::>job show
4. The following KB shows how to scp the backup files you created, as well as one of the system-created backups off to the Linux client:
https://kb.netapp.com/support/index?page=content&id=1012580
Use the following to install p7zip on your Linux client and use it to unzip the backup files.
# yum install p7zip
This is the recommended practice on live nodes however for vsims scp does not work.
So in the current lab setup ,drop to the systemshell and cd to /mroot/etc/backups/config
Unzip the system created backup file by doing the following:
% 7za e [system created backup file name]
What is in this file?
cd into one of the folders created by the unzip. There will be another 7z file. Extract it:
% 7za e [file name]
What’s in this file?
Extract the file:
% 7za e [file name]
What’s inside of it?
Compare it to what is in /mroot/etc of one of the cluster nodes. What are some of the differences?
5. cd into “cluster_config” in the backup. What is different from /mroot/etc/cluster_config on the node?
6. cd into “cluster_replicated_records” at the root of the folder you originally extracted the backup to and issue an “ls” command.
What do you see?
7. Unzip the node and cluster backups you created. What do you notice about the contents of these files?
Exercise 3: Moving mroot to a new aggregate
Time Estimate: 30 minutes
Step Action
1. Move a node’s root volume to a new aggregate.
Work with your lab partners and do this on only one node.
For live nodes the following KB contains the steps to do this:
https://kb.netapp.com/support/index?page=content&id=1013350&actp=LIST
However for vsims the root volume that is created by default is only 20MB and too small to hold the cluster configuration information.
Hence follow the steps given below:
2. Run the following command to create a new 3-disk aggregate on the desired node :
cluster1::> aggr create -aggregate new_root -diskcount 3 -nodes local
[Job 276] Job succeeded: DONE
cluster1::> aggr show -nodes local
Aggregate Size Available Used% State #Vols Nodes RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
aggr0_cluster1_02_0
900MB 15.45MB 98% online 1 cluster1-02
raid_dp,
normal
student2 900MB 467.4MB 48% online 8 cluster1-02 raid_dp,
normal
2 entries were displayed.
3. Ensure that the node does not own an epsilon. If it does, run the following command to move it to another node in the cluster:
cluster1::> set diag
Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y
cluster1::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
cluster1-01 true true false
cluster1-02 true true true
cluster1-03 true true false
cluster1-04 true true false
4 entries were displayed.
Run the following command to move the epsilon and modify it to 'false' on the owning node:
::*> cluster modify -node cluster1-02 -epsilon false
Then, run the following command to modify it to 'true' on the desired node:
::*> cluster modify -node cluster1-01 -epsilon true
::*> cluster show
Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
cluster1-01 true true true
cluster1-02 true true false
cluster1-03 true true false
cluster1-04 true true false
4 entries were displayed.
4. Run the following command to set the cluster eligibility on the node to 'false':
::*> cluster modify -node cluster1-02 -eligibility false
Note: This action must be performed on a node that is not to be marked as ineligible.
5. Run the following command to reboot the node into maintenance mode
cluster1::*> reboot local
(system node reboot)
Warning: Are you sure you want to reboot the node? {y|n}: y
login:
Waiting for PIDS: 718.
Waiting for PIDS: 695.
Terminated
.
Uptime: 2h12m14s
System rebooting...
\
Hit [Enter] to boot immediately, or any other key for command prompt.
Booting...
x86_64/freebsd/image1/kernel data=0x7ded08+0x1376c0 syms=[0x8+0x3b7f0+0x8+0x274a 8]
x86_64/freebsd/image1/platform.ko size 0x213b78 at 0xa7a000
NetApp Data ONTAP 8.1.1X34 Cluster-Mode
Copyright (C) 1992-2012 NetApp.
All rights reserved.
md1.uzip: 26368 x 16384 blocks
md2.uzip: 3584 x 16384 blocks
*******************************
* *
* Press Ctrl-C for Boot Menu. *
* *
*******************************
^CBoot Menu will be available.
Generating host.conf.
Please choose one of the following:
(1) Normal Boot.
(2) Boot without /etc/rc.
(3) Change password.
(4) Clean configuration and initialize all disks.
(5) Maintenance mode boot.
(6) Update flash from backup config.
(7) Install new software first.
(8) Reboot node.
Selection (1-8)? 5
….
WARNING: Giving up waiting for mroot
Tue Sep 11 11:23:27 UTC 2012
*> Sep 11 11:23:28 [cluster1-02:kern.syslog.msg:info]: root logged in from SP NONE
*>
6. Run the following command to set the options for the new aggregate to become the new root:
Note: It might be required to set the aggr options to CFO instead of SFO:
*> aggr options new_root root
aggr options: This operation is not allowed on aggregates with sfo HA Policy
*> aggr options new_root ha_policy cfo
Setting ha_policy to cfo will substantially increase the client outage during giveback for cluster volumes on aggregate new_root.
Are you sure you want to proceed? y
*> aggr options new_root root
Aggregate 'new_root' will become root at the next boot.
*>
7. Run the following command to reboot the node: *> halt
Sep 11 11:27:49 [cluster1-02:kern.cli.cmd:debug]: Command line input: the command is 'halt'. The full command line is 'halt'.
.
Uptime: 6m26s
The operating system has halted.
Please press any key to reboot.
System halting...
\
Hit [Enter] to boot immediately, or any other key for command prompt.
Booting in 1 second...
8. Once the node is booted, a new root volume named AUTOROOT will be created. In addition, the node will not be in quorum yet. This is because the new root volume will not be aware of the cluster.
login: admin
Password:
***********************
** SYSTEM MESSAGES **
***********************
A new root volume was detected. This node is not fully operational. Contact
support personnel for the root volume recovery procedures.
cluster1-02::>
9. Increase the size of AUTOROOT on the node by doing the following
Log in to the systemshell of a node which is in quorum and execute the following d-blade zapis to
a) Get the uuid of volume AUTOROOT of the node where root volume was changed
b) Increase the size of the same AUTOROOT volume by 500m
c) Check if the size is successfully changed
% zsmcli -H <cluster ip address of the node where new root volume was created> d-volume-list-info-iter-start desired-attrs =name,uuid
<results status="passed">
<next-tag>cookie=0,desired_attrs=name,uuid</next-tag>
</results>
% zsmcli -H <cluster ip address of the node where new root volume was created> d-volume-list-info-iter-next maximum-record s=10 tag='cookie=0,desired_attrs=name,uuid'
<results status="passed">
<volume-attrs>
<d-volume-info>
<name>vol0</name>
<uuid>014df353-bbc1-11e1-bb4c-123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_root</name>
<uuid>044f53fa-e784-11e1-ab6e-123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_LS_root</name>
<uuid>0ea7ae4c-e790-11e1-ab6e-
123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>AUTOROOT</name>
<uuid>30d8f742-fc04-11e1-bbf5-123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_cifs</name>
<uuid>b8868843-e788-11e1-ab6e-123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_cifs_child</name>
<uuid>c07f13ce-e788-11e1-ab6e-123478563412</uuid>
</d-volume-info>
<d-volume-info>
<name>student2_nfs</name>
<uuid>c861f83b-e788-11e1-ab6e-123478563412</uuid>
</d-volume-info>
% zsmcli -H 192.168.71.33 d-volume-set-info desired-attrs=size id=30d8f742-fc04-11e1-bbf5-123478563412 volume-attrs='[d-volume-info=[size=+500m]]'
<results status="passed"/>
% zsmcli -H 192.168.71.33 d-volume-list-info id=30d8f742-fc04-11e1-bbf5-123478563412 desired-attrs=size
<results status="passed">
<volume-attrs>
<d-volume-info>
<size>525m</size>
</d-volume-info>
</volume-attrs>
</results>
10. Clear the root recovery flags if required by doing the following:
Log in to the systemshell of the node where the new root volume was created and
check if the bootarg.init.boot_recovery bit is set
% sudo kenv bootarg.init.boot_recovery
If a value is returned, and it is not kenv: unable to get bootarg.init.boot_recovery, clear the bit.
% sudo sysctl kern.bootargs=--bootarg.init.boot_recovery
kern.bootargs: ->
Check that the bit is cleared
% sudo kenv bootarg.init.boot_recovery
kenv: unable to get bootarg.init.boot_recovery
%
11. From a healthy node, with all nodes booted, run the following command: ::*> system configuration recovery cluster rejoin -node <the node where new root volume was created>
Warning: This command will rejoin node "cluster1-02" into the local cluster, potentially overwriting critical cluster
configuration files. This command should only be used to recover from a disaster. Do not perform any other recovery
operations while this operation is in progress. This command will cause node "cluster1-02" to reboot.
Do you want to continue? {y|n}: y
Node "cluster1-02" is rebooting. After it reboots, verify that it joined the new cluster.
12. After a boot, check the cluster to ensure that the node is back and eligible:
cluster1::> cluster show
Node Health Eligibility
--------------------- ------- ------------
cluster1-01 true true
cluster1-02 true true
cluster1-03 true true
cluster1-04 true true
4 entries were displayed.
13. If the cluster is still not in quorum, run the following command:
::*> system configuration recovery cluster sync <node where new root
volume was created> Warning: This command will synchronize node "cluster1-02" with the cluster configuration, potentially overwriting critical cluster configuration files on the node. This feature should only be used to recover from a disaster. Do not perform any other recovery operations while this operation is in progress. This command will cause all the cluster applications on node "node4" to restart, interrupting administrative CLI and Web interface on that node. Do you want to continue? {y|n}: y All cluster applications on node "cluster1-02" will be restarted. Verify that the cluster applications go online.
14. After the node is in quorum, run the following command to add the new root vol to VLDB. This is necessary because it is a 7-Mode volume and will not be displayed until it is added:
cluster1::> set diag cluster1::*> vol show -vserver cluster1-02
(volume show)
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster1-02
vol0 aggr0_cluster1_02_0
online RW 851.5MB 283.3MB 66% cluster1::*> vol add-other-volumes -node cluster1-02
(volume add-other-volumes)
cluster1::*> vol show -vserver cluster1-02
(volume show)
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster1-02
AUTOROOT new_root online RW 525MB 379.2MB 27%
cluster1-02
vol0 aggr0_cluster1_02_0
online RW 851.5MB
283.3MB 66%
2 entries were displayed.
15. Run the following command to remove the old root volume from VLDB
cluster1::*> vol remove-other-volume -vserver cluster1-02 -volume vol0
(volume remove-other-volume)
cluster1::*> vol show -vserver cluster1-02
(volume show)
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster1-02
AUTOROOT new_root online RW 525MB 379.2MB 27%
16. Destroy the old root vol by running the following command from the node shell of the node where the new root volume has been created
cluster1::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cluster1-02> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex nvfail=on
64-bit
Volume UUID: 014df353-bbc1-11e1-bb4c-123478563412
Containing aggregate: 'aggr0_cluster1_02_0'
cluster1-02> vol offline vol0
Volume 'vol0' is now offline.
cluster1-02> vol destroy vol0
Are you sure you want to destroy volume 'vol0'? y
Volume 'vol0' destroyed.
And the old root aggr can be destroyed if desired:
From cluster shell:
cluster1::*> aggr show -node <node where new root vol was
created>
Aggregate Size Available Used% State #Vols Nodes RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
aggr0_cluster1_02_0
900MB 899.7MB 0% online 0 cluster1-02 raid_dp,
normal
new_root 900MB 371.9MB 59% online 1 cluster1-02 raid_dp,
normal
student2 900MB 467.2MB 48% online 8 cluster1-02 raid_dp,
normal
3 entries were displayed.
cluster1::*> aggr delete -aggregate <old root aggregate name>
Warning: Are you sure you want to destroy aggregate "aggr0_cluster1_02_0"?
{y|n}: y
[Job 277] Job succeeded: DONE
17. Use the following KB rename the root volume(AUTOROOT) to vol0 https://kb.netapp.com/support/index?page=content&id=2015985
18. What sort of things regarding the root vol did you observe during this?
Exercise 4: Locate and Repair Aggregate Issues
Time Estimate: 15 minutes
Action
1. Login to clustershell of clusterX and execute the following: ::> aggr show -aggregate VLDBX (team member 1 use X=1 and team
member 2 use X = 2) There are no entries matching your query.
One aggregate is showing as missing from the cluster shell:
Execute the following:
::> aggr show -aggregate WAFLX -instance Aggregate: WAFLX Size: - Used Size: - Used Percentage: - Available Size: - State: unknown Nodes: cluster1-02
Another aggregate is showing as “unknown”:
Fix the issue.
2. Issue the following command. Do you see anything wrong?
::*> debug vreport show aggregate
3. What nodes do the aggregates belong to? How do you know?
4. Use the “debug vreport fix” command to resolve the problem.
5. List some of the reasons why customers could have this problem.
6. Was any data lost? If so, which aggregate?
Exercise 5: Replication failures
Time Estimate: 20 minutes
Action
1. Note:Participants working with cluster2 should replace student1 with student3 and student2 with student4 in all the steps of this exercise
Log in to systemshell clusterX-02 (make sure it does not own epsilon)
Unmount mroot and clus and prevent mgwd from being monitored by spmctl, as follows:
% sudo umount -f /mroot
% sudo umount -f /clus
% spmctl -d -h mgwd
2. Login to ngsh on clusterX-02 and execute the following:
cluster1::*> volume create -vserver student1 -volume test -aggregate
Info: Node cluster1-01 that hosts aggregate aggr0 is offline
Node cluster1-03 that hosts aggregate aggr0_cluster1_03_0 is offline
Node cluster1-04 that hosts aggregate aggr0_cluster1_04_0 is offline
Node cluster1-01 that hosts aggregate student1 is offline
aggr0 aggr0_cluster1_03_0 aggr0_cluster1_04_0
new_root student1 student2
cluster1::*> volume create -vserver student1 -volume test -aggregate student2
Error: command failed: Replication service is offline
cluster1::*> net int create -server student1 -lif test -role data -home-node cluster1-02 -home-port e0c -address
10.10.10.10 -netmask 255.255.255.0 -status-admin up
(network interface create)
Info: An error occurred while creating the interface, but a new routing group
d10.10.10.0/24 was created and left in place
Error: command failed: Local unit offline
cluster1::*> vserver create -vserver test -rootvolume test -aggregate student1 -ns-switch file -rootvolume-security-style unix
Info: Node cluster1-01 that hosts aggregate student1 is offline
Error: create_imp: create txn failed
command failed: Local unit offline
3. Login to ngsh on clusterX-01 and execute the following:
cluster1::> volume create test -vserver student2 -aggregate
Info: Node cluster1-02 that hosts aggregate new_root is offline
Node cluster1-02 that hosts aggregate student2 is offline
aggr0 aggr0_cluster1_03_0 aggr0_cluster1_04_0
new_root student1 student2
cluster1::> volume create test -vserver student2 -aggregate student2 -size 20MB
Info: Node cluster1-02 that hosts aggregate student2 is offline
Error: command failed: Failed to create the volume because cannot determine the
state of aggregate student2.
cluster1::> volume create test -vserver student2 -aggregate student1 -size 20MB
[Job 368] Job succeeded: Successful
Note: when a volume is created on an aggregate not hosted on clusterX-02 , the volume create succeeds
cluster1::> net int create -vserver student1 -lif data2 -role data -data-protocol nfs,cifs,fcache -home-node cluster1-02 -home-port e0c -address 10.10.10.10 -netmask 255.255.255.0
(network interface create)
Info: create_imp: Failed to create virtual interface
Error: command failed: Routing group d10.10.10.0/24 not found
cluster1::> net int create -vserver student1 -lif data2 -role data -data-protocol nfs,cifs,fcache -home-node cluster1-01 -home-port e0c -address 10.10.10.10 -netmask 255.255.255.0
(network interface create)
Note: when an interface is created on port not hosted on clusterX-02 the interface create succeeds
cluster1::*> vserver create -vserver test -rootvolume test -aggregate student2 -ns-switch file -rootvolume-security-style unix
Info: Node cluster1-02 that hosts aggregate student2 is offline
Error: create_imp: create txn failed
command failed: Local unit offline
cluster1::*> vserver create -vserver test -rootvolume test -aggregate student1 -ns-switch file -rootvolume-security-style unix
[Job 435] Job succeeded: Successful
Note: when a vserver is created and its root volume is created an aggregate that is not hosted on clusterX-02 the vserver create succeeds
4. Log in to systemshell of clusterX-02.
Execute the following:
cluster1-02% mount
/dev/md0 on / (ufs, local, read-only)
devfs on /dev (devfs, local)
/dev/ad0s2 on /cfcard (msdosfs, local)
/dev/md1.uzip on / (ufs, local, read-only, union)
/dev/md2.uzip on /platform (ufs, local, read-only)
/dev/ad3 on /sim (ufs, local, noclusterr, noclusterw)
/dev/ad1s1 on /var (ufs, local, synchronous)
procfs on /proc (procfs, local)
/dev/md3 on /tmp (ufs, local, soft-updates)
/mroot/etc/cluster_config/vserver on /mroot/vserver_fs
vserverfs, union)
Note that /mroot and /clus are not mounted
5 From systemshell of clusterX-02 run following commands:
% rdb_dump
What do you see?
%tail -100 /mroot/etc/mlog/mgwd.log |more
What do you see?
Log in to systemshell of cluster-01 and run the following command
%tail -100 /mroot/etc/mlog/mgwd.log |more
What do you see?
6. From systemshell of clusterX-02 run:
%spmctl
What do you see?
6. What happened?
7. Fixing these issues:
a) Re-add mgwd to spmctl with:
% ps aux | grep mgwd
root 779 0.0 17.6 303448 133136 ?? Ss 1:53PM 0:44.12 mgwd -z
diag 3619 0.0 0.2 12016 1204 p2 S+ 4:39PM 0:00.00 grep mgwd
% spmctl -a -h mgwd -p 779
b) Then restart mgwd which will mount /mroot and /clus
% sudo kill <PID>
Exercise 6: Troubleshooting Autosupport
Time Estimate: 20 minutes
Action
1. From clustershell of each node send a test autosupport as follows: (y takes the values 1,2,3,4)
::*> system autosupport invoke -node clusterX-0y -type test
You will see an error such as:
Error: command failed: RPC: Remote system error - Connection refused
2. Let’s find out Why? Connection refused means that we couldn't talk to the application for some reason. In this case, notifyd is the application. When we look at systemshell for the process, it's not there: cluster1-01% ps aux | grep notifyd diag 5442 0.0 0.2 12016 1160 p0 S+ 9:20PM 0:00.00 grep notifyd
3. spmctl manages notifyd
We can check to see why spmctl didn't start notifyd back up:
cluster-1-01% cat spmd.log | grep -i notify 0000002e.00001228 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x800702d30: INFO: spmd::ProcessController: sendShutdownSignal:process_controller.cc:186 sending SIGTERM to 5498: 0000002e.00001229 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:152 kevent returned: 1 0000002e.0000122a 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessControlManager: dumpExitConditions:process_control_manager.cc:732 process (notifyd:5498) exited on signal 15 0000002e.0000122b 0002ba7d Tue Aug 09 2011 21:26:32 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:148
wait for incoming events. And then we check spmctl to see if it's still monitoring notifyd: cluster-1-01% spmctl | grep notify In this case, it looks like notifyd got removed from spmctl and we need to re-add it: cluster-1-01% spmctl -e -h notifyd cluster-1-01% spmctl | grep notify Exec=/sbin/notifyd -n;Handle=56548532-c334-4633-8cd8- 77ef97682d3d;Pid=15678;State=Running cluster-1-01% ps aux | grep notify
root 15678 0.0 6.7 112244 50568 ?? Ss 4:06PM 0:02.42 /sbin/notifyd –
diag 15792 0.0 0.2 12016 1144 p2 S+ 4:06PM 0:00.00 grep notify
4. Try to send a test autosupport.
::*> system autosupport invoke -node clusterX-0y -type test
What happens?
MODULE 3: SCON
Exercise 1: Vifmgr and MGWD interaction
Time Estimate: 30 minutes
Step Action
1. Try to create an interface:
clusterX::*> net int create -vserver studentY -lif test -role data -data-protocol nfs,cifs,fcache -home-node clusterX-02 -home-port
You see the following error:
Warning: Unable to list entries for vifmgr on node clusterX-02. RPC: Remote
system error - Connection refused
{<netport>|<ifgrp>} Home Port
2. Ping interfaces of clusterX-02 the node whose ports seem inaccessible
clusterX::*> cluster ping-cluster -node clusterX-02
What do you see?
3. Perform data access:
Attempt cifs access to \\student2\student2(cluster1) or \\student4\student4(cluster2) from the windows machine
What happens?
4. Execute the following:
clusterX::*> net int show
What do you see?
5. Run net port show clusterX::*> net port show What do you see?
6. Check the system logs: clusterX::*> debug log files modify -incl-files vifmgr,mgwd clusterX::*> debug log show –node clusterX-02 –timestamp Mon
Oct 10* What do you see?
7. Log in to systemshell on clusterX-02 and run ps to see if vifmgr is running: clusterX-02% ps -A |grep vifmgr
8. Run rdb_dump from systemshell of clusterX-02 clusterX-02% rdb_dump What do you see?
9. Run the following from systemshell of clusterX-02: clusterX-02% spmctl | grep vifmgr What do you see
10. In cluster shell execute cluster ring show clusterX::*> cluster ring show
11. What is the Issue? How do you fix it?
Exercise 2: Duplicate lif IDs
Time Estimate: 30 minutes
Step Action
1.
From the clustershell create a new network interface as follows: Y E {1,2,3,4}
clusterX::*> net int create -vserver studentY -lif data1 -role data -data-protocol nfs,cifs,fcache -home-node clusterX-0Y -home-port e0c -address 192.168.81.21Y -netmask 255.255.255.0 -status-admin up
(network interface create)
Info: create_imp: Failed to create virtual interface
Error: command failed: Duplicate lif id
2. Execute the following:
clusterX::*> net int show
What do you see?
3. View the mgwd log file on the node where you are giving the net int create command and determine the lifid which is eing reported as duplicate
4.
Execute the following:
clusterX::*>debug smdb table vifmgr_virtual_interface show -node clusterX-0* -lif-id [lifid/vifid determined from step 3]
What do you see?
5. Execute the following:
clusterX::*> debug smdb table vifmgr_virtual_interface delete -node clusterX-0Y –lif-id <the duplicate id >
clusterX::*> debug smdb table vifmgr_virtual_interface show -node clusterX-0Y -lif-id <the duplicate id>
There are no entries matching your query.
5. Create new lif:
clusterX::*> net int create -vserver studentY -lif testY -role data -data-protocol
nfs,cifs,fcache -home-node clusterX-0Y -home-port e0c -address 192.168.81.21Y -netmask 255.255.255.0 -status-admin up
(network interface create)
MODULE 4: NFS
Exercise 1: Mount issues Time Estimate: 20 minutes
Step Action
1. From the Linux Host execute the following:
#mkdir /cmodeY
#mount studentY:/studentY_nfs /cmodeY
You See the following:
mount: mount to NFS server 'studentY' failed: RPC Error: Program not registered.
2. Find out the node being mounted:
From the Linux Host execute the following to find the IP address being accessed:
#ping studentY
PING studentY (192.168.81.115) 56(84) bytes of data.
64 bytes from studentY (192.168.81.115): icmp_seq=1 ttl=255 time=1.09 ms
From the clustershell use the following to find out the current node and port on which the above IP address is hosted
clusterX::*> net int show -vserver studentY -address 192.168.81.115 -fields curr-node,curr-port
(network interface show)
vserver lif curr-node curr-port
-------- -------------- ----------- ---------
studentY studentY_data1 clusterX-01 e0d
3. Execute the following to start a packet trace from the nodeshell of the node that was being mounted and attempt the mount once more
clusterX::*> run -node clusterX-01
Type 'exit' or 'Ctrl-D' to return to the CLI
clusterX-01> pktt start e0d
e0d: started packet trace
From the Linux Host attempt the mount once more as shown below:
# mount student1:/student1_nfs /cmode1
Back in the nodeshell of the node that was mounted dump and stop the packet trace
clusterX-01> pktt dump e0d
clusterX-01> pktt stop e0d
e0d: Tracing stopped and packet trace buffers released.
From the systemshell of the node where the packet trace was captured view the packet trace using tcpdump
clusterX-01> exit
logout
clusterX::*> systemshell -node clusterX-01
clusterX-01% cd /mroot
clusterX-01% ls
e0d_20120925_131928.trc home vserver_fs
etc trend
clusterX-01% tcpdump –r e0d_20120925_131928.trc
What do you see? Why?
4. How do you fix the issue?
5. After fixing the issue check that the mount is successful.
Note:If the mount succeeds please unmount.This step is very important or the rest of the exercises will be impacted
Exercise 2: Mount and access issues
Time Estimate: 30 minutes
Step Action
1. From the Linux Host attempt to mount volume studentX_nfs.
# mount studentX:/studentX_nfs /cmode
mount: studentX:/studentX_nfs failed, reason given by server: Permission denied
2. From clustershell execute the following to find the export policy associated with the volume studentX_nfs:
cluster1::*> vol show -vserver studentX -volume studentX_nfs –instance
Next use the “export-policy rule show” to find the properties of the export policy associated with the volume studentX_nfs
Why did you get an access denied error?
How will you fix the issue
3. Now once again attempt to mount studentX_nfs from the Linux Host
# mount studentX:/studentX_nfs /cmode
mount: studentX:/studentX_nfs failed, reason given by server: No such file or directory
What issue is occurring here?
4. Now once again attempt to mount studentX_nfs from the Linux Host
# mount studentX:/studentX_nfs /cmode
Is the mount successful?
If yes, cd into the mount point
#cd /cmode
-bash: cd: /cmode: Permission denied
How do you resolve this?
Note: Depending on how you resolved the issue with the export-policy in step 1 you may not see any error here.In that case move on to step 4
If you unmount and remount, does it still work?
5.
Try to write a file into the mount
[root@nfshost cmode]# touch f1
What does ls –la show?
[root@nfshost cmode]# ls -la
total 16
drwx------ 2 admin admin 4096 Sep 25 08:06 .
drwxr-xr-x 26 root root 4096 Sep 25 06:03 ..
-rw-r--r-- 1 admin admin 0 Sep 25 08:06 f1
drwxrwxrwx 12 root root 4096 Sep 25 08:05 .snapshot
What do you see the file permissions as?
Why are the permissions and owner set the way they are?
6. From clustershell Execute:
clusterX::> export-policy rule modify -vserver studentY -policyname studentY -ruleindex 1 -rorule any -rwrule any
(vserver export-policy rule modify)
Exercise 3: Stale file handle
Time Estimate: 30 minutes
Step Action
1. From the Linux Host execute:
# cd /nfsX
-bash: cd: /nfsX: Stale NFS file handle
2. Unmount the volume from the client and try to re-mount. What happens?
3. From the Linux Host:
# ping studentX
PING studentX (192.168.81.115) 56(84) bytes of data.
The underlined IP above is the IP of vserver being mounted.
Find the node in the cluster that is currently hosting this IP
From your clustershell
::*> net int show -address 192.168.81.115 -fields curr-node
(network interface show)
vserver lif curr-node
-------- -------------- -----------
studentX studentX_data1 clusterY-0X
The node underlined above is the node that is currently hosting the IP.
Log in to the systemshell of this node and view the vldb logs
cluster1::*> systemshell -node clusterY-0X
cluster1-01% tail /mroot/etc/mlog/vldb.log
What do you see?
4. Look for volumes with the MSID in the error shown in the vldb log as follows:
From clustershell execute the following to find the aggregate where the volume being mounted(nfs_studentX) lives and on which node that aggregate lives:
cluster1::*> vol show -vserver studentX -volume nfs_studentX -fields aggregate (volume show)
vserver volume aggregate
-------- ------------ ---------
studentX nfs_studentX studentX
cluster1::*> aggr show -aggregate studentX -fields nodes
aggregate nodes
--------- -----------
studentx clusterY-0X
Go to nodeshell of the node (underlined above) that hosts the volume and its aggregate and use the showfh command and convert the msid from hex.
::>run –node clusterY-0X
>priv set diag
*>showfh /vol/nfs_studentX
flags=0x00 snapid=0 fileid=0x000040 gen=0x5849a79f fsid=0x16cd2501 dsid=0x0000000000041e msid=0x00000080000420
0x00000080000420 converted to decimal is 2147484704
Exit from nodeshell back to clustershell abd execute debug vreport show in diag mode:
cluster1-01*> exit
logout
cluster1::*> debug vreport show
What do you see?
5. What is the issue here?
6. How would you fix this?
MODULE 5: CIFS
Instructions to Students: As mentioned in the lab handout the valid windows users in the domain Learn.NetApp.local are:
a) Administrator b) Student1 c) Student2
Exercise 1: Using diag secd
Time Estimate: 20 minutes
Step Action
1. Find the node where the IP(s) for vserver studentX is hosted
From the RDP machine do the following to start a command window
Start->Run->cmd
In the command window type
ping studentX
From the clustershell find the node on which the IP is hosted (Refer to NFS Exercise 3)
Login to the console of that node and execute the steps of this exercise
2. Type the following:
::> diag secd
What do you see and why?
3. Note: for all the steps of this exercise clusterY-0X should be the name of the local node
Type the following to verify the name mapping of windows user student1 ,.
::diag secd*> name-mapping show -node local -vserver studentX -direction win-unix -name student1
4. From the RDP machine do the following to access a cifs share
Start -> Run -> \\studentX
Type the following to query for the Windows SID of your windows user name
cluster1::diag secd*> authentication show-creds -node local -vserver studentX -win-name <username that you have used to RDP to the windows machine>
DC Return Code: 0
Windows User: Administrator Domain: LEARN Privs: a7
Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513
Domain: S-1-5-21-3281022357-2736815186-1577070138 Rids: 500, 572, 519, 518, 512, 520, 513
Domain: S-1-5-32 Rids: 545, 544
Domain: S-1-1 Rids: 0
Domain: S-1-5 Rids: 11, 2
Unix ID: 65534, GID: 65534
Flags: 1
Domain ID: 0
Other GIDs:
cluster1::diag secd*> authentication translate -node local -vserver student1 -win-name <username that you have used to RDP to the windows machine>
S-1-5-21-3281022357-2736815186-1577070138-500
5. Type the following to test a Windows login for your user windows name in diag secd
cluster1::diag secd*> authentication login-cifs -node local -vserver studentX -user <username that you have used to RDP to the windows machine>
Enter the password: <your windows password i.e Netapp123>
Windows User: Administrator Domain: LEARN Privs: a7
Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513
Domain: S-1-5-21-3281022357-2736815186-1577070138 Rids: 500, 513, 520, 512, 518, 519, 572
Domain: S-1-1 Rids: 0
Domain: S-1-5 Rids: 11, 2
Domain: S-1-5-32 Rids: 544
Unix ID: 65534, GID: 65534
Flags: 1
Domain ID: 0
Other GIDs:
Authentication Succeeded.
6. Type the following to view active CIFS connections in secd
cluster1::diag secd*> connections show -node clusterY-0X -vserver studentX
[ Cache: NetLogon/learn.netapp.local ]
Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms
Performance> Hits: 0, Misses: 1, Failures: 0, Avg Retrieval: 24505.00ms
(No connections active or currently cached)
[ Cache: LSA/learn.netapp.local ]
Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms
Performance> Hits: 1, Misses: 4, Failures: 0, Avg Retrieval: 6795.40ms
(No connections active or currently cached)
[ Cache: LDAP (Active Directory)/learn.netapp.local ]
Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms
Performance> Hits: 1, Misses: 3, Failures: 1, Avg Retrieval: 2832.75ms
(No connections active or currently cached)
Type the following to clear active CIFS connections in secd
cluster1::diag secd*> connection clear -node clusterY-0X –vserver studentX
Test connections on vserver student1 marked for removal.
NetLogon connections on vserver student1 marked for removal.
LSA connections on vserver student1 marked for removal.
LDAP (Active Directory) connections on vserver student1 marked for removal.
LDAP (NIS & Name Mapping) connections on vserver student1 marked for removal.
NIS connections on vserver student1 marked for removal.
7. Type the following to view the server discovery information
cluster1::diag secd*> server-discovery show-host -node clusterY-0X
Host Name: win2k8-01
Cifs Domain:
AD Domain:
IP Address: 192.168.81.10
Host Name: win2k8-01
Cifs Domain:
AD Domain:
IP Address: 192.168.81.253
Type the following to achieve the same result as ONTAP 7G’s “cifs resetdc”
cluster1::diag secd*> server-discovery reset -node clusterY-0X -vserver studentX
Discovery Reset succeeded for Vserver: student1
To verify type the following:
cluster1::diag secd*> server-discovery show-host -node clusterY-0X
Discovery Reset succeeded for Vserver: studentX
Type the following to achieve the same result as ONTAP 7G’s
“cifs testdc”?
cluster1::diag secd*> server-discovery test -node clusterY-0X -vserver studentX
Discovery Global succeeded for Vserver: studentX
8. Type the following to view current logging level in secd
cluster1::diag secd*> log show -node clusterY-0X
Log Options
----------------------------------
Log level: Debug
Function enter/exit logging: OFF
Type the following to set and view the current logging level in secd
cluster1::diag secd*> log set -node clusterY-0X -level err
Setting log level to "Error"
cluster1::diag secd*> log show -node clusterY-0X
Log Options
----------------------------------
Log level: Error
Function enter/exit logging: OFF
9. Type the following to enable tracing in secd to capture the logging level specified
cluster1::diag secd*> trace show -node local
Trace Spec
---------------------------------------
Trace spec has not been set.
cluster1::diag secd*> trace set -node cluster1-01 -trace-all yes
Trace spec set successfully for trace-all.
cluster1::diag secd*> trace show -node cluster1-01
Trace Spec
---------------------------------------
TraceAll: Tracing all RPCs
10. Type the following to check secd configuration for comparison with the ngsh settings?
cluster1::diag secd*> config query -node local -source-name
cifs-server kerberos-realm machine-account
nis-domain vserver vserverid-to-name
unix-group-membership local-unix-user local-unix-group
kerberos-keyblock ldap-config ldap-client-config
ldap-client-schema name-mapping nfs-kerberos
cifs-server-options cifs-server-security dns
cifs-preferred-dc virtual-interface routing-group-routes
secd-cache-config
cluster1::diag secd*> configuration query -node local -source-name machine-account
vserver: 5
cur_pwd: 0100962681ce82e2d6da20df35ce86964fea2c495d9609d395a5199431d3d4531144f845fcfd675e15143fe76932ced271ddcf57c9d8fe59a63b0bc68f717077fc88ca28aa0fdbba4b8d8509bb25ebe2
new_pwd:
installdate: 1345202770
sid: S-1-5-21-3281022357-2736815186-1577070138-1609
vserver: 6
cur_pwd: 01433517c8acbbf66c2e287b4bee56f5d8b707cfb69710737bfb20616ebe61fc31163acde2b5a827f3c2d395b89fef15f28a8f514c147906580cbaa30b4a1361444f76036d2c590222ce1a0feaa56779
new_pwd:
installdate: 1345202787
sid: S-1-5-21-3281022357-2736815186-1577070138-1610
11. Type the following to clear the cache(s) one at a time
cluster1::diag secd*> cache clear -node clusterY-0X -vserver studentX -cache-name
ad-to-netbios-domain netbios-to-ad-domain ems-delivery
ldap-groupid-to-name ldap-groupname-to-id ldap-userid-to-creds
ldap-username-to-creds log-duplicate name-to-sid
sid-to-name nis-groupid-to-name nis-groupname-to-id
nis-userid-to-creds nis-username-to-creds nis-group-membership
netgroup schannel-key lif-bad-route-to-target
cluster1::diag secd*> cache clear -node clusterY-0X -vserver studentX -cache-name ad-to-netbios-domain
Type the following to clear all caches together
cluster1::diag secd*> restart -node clusterY-0X
You are attempting to restart a process in charge of security services. Do not
restart this process unless the system has generated a "secd.config.updateFail"
event or you have been instructed to restart this process by support personnel.
This command can take up to 2 minutes to complete.
Are you sure you want to proceed? {y|n}: y
Restart successful! Security services are operating correctly.
12. From the RDP machine close the cifs share \\studentX opened in windows explorer
Exercise 2: Authentication issues Time Estimate: 30 minutes
Step Action
1. From the RDP machine access the cifs share \\studentX
Start->Run->\\studentX
What error message do you see?
2. Refer to step 1 of exercise 1 and
Find the node where the IP(s) for vserver studentX is hosted
Login to the console of that node and execute the steps of this exercise
From clustershell of the node , run the following commands:
::> set diag
::*> diag secd authentication translate -node local -vserver studentX -win-name <your windows username>
::*> diag secd authentication sid-to-uid -node local -vserver studentX -sid <sid from previous command>
::*> diag secd authentication show-creds -node local -vserver studentX -win-name <username>
Does the user seem to be functioning properly? If not, what error do you get?
3. Run the following command:
::> event log show
What message do you see?
4. Run the following command:
::> diag secd name-mapping show -node local -vserver
student1 -direction win-unix
-name <your windows username>
::> vserver name-mapping show -vserver studentX –direction win-unix –position *
::> cifs options show –vserver studentX
5. Which log in systemshell can we look at to see errors for this problem?
6. What issues did you find?
7. cluster1::*> unix-user create -vserver studentX -user pcuser -id 65534 -primary-gid 65534
(vserver services unix-user create)
cluster1::*> cifs option modify -vserver studentX -default-unix-user pcuser
8. The Windows Explorer window which opens when you navigate to Start->Run->\\studentX shows 2 shares .
a) studentX
b) studentX_child
Try to access the shares
What happens?
Do the following:
Enable debug logging for secd on the node that owns your data lifs
cluster1::*> diag secd log set -node local -level debug
Setting log level to "Debug"
cluster1::*> trace set -node local -trace-all yes
(diag secd trace set)
Trace spec set successfully for trace-all.
Close the CIFS session on the Windows host and run “net use /d *” from cmd to clear cached sessions and retry the connection
Enter systemshell and cd to /mroot/etc/mlog
Type “tail –f secd.log”
What do you see?
9. Given the results of the previous tests, what could the issue be here?
10. From ngsh(custershell) run:
cluster1::> vserver show -vserver studentX -fields rootvolume
vserver rootvolume
-------- -------------
studentX studentX_root
The value highlighted in bold is the root volume of the vserver you are acessing
cluster1::>vserver cifs share show -vserver studentX -share-name studentX
Vserver: studentX
Share: studentX
CIFS Server NetBIOS Name: STUDENTX
Path: /studentX_cifs
Share Properties: oplocks
browsable
changenotify
Symlink Properties: -
File Mode Creation Mask: -
Directory Mode Creation Mask: -
Share Comment: -
Share ACL: Everyone / Full Control
File Attribute Cache Lifetime: -
cluster1::*> vserver cifs share show -vserver studentX -share-name studentX_child
Vserver: studentX
Share: studentX_child
CIFS Server NetBIOS Name: STUDENTX
Path: /studentX_cifs_child
Share Properties: oplocks
browsable
changenotify
Symlink Properties: -
File Mode Creation Mask: -
Directory Mode Creation Mask: -
Share Comment: -
Share ACL: Everyone / Full Control
File Attribute Cache Lifetime: -
From the above commands obtain the name of the volumes being accessed via the shares
11. Now that you know the volumes you are trying to access use fsecurity show to view permissions on these.
cluster1::*> vol show -vserver studentX -volume studentX_cifs –instance
Find on which node the aggregate where studentX_cifs lives is hosted on
From node shell of that node run:
cluster1-01> fsecurity show /vol/studentX_cifs
What do you see?
cluster1::*> vol show -vserver studentX -volume studentX_cifs_child –instance
Find on which node the aggregate where studentX_cifs_child lives is hosted on
From node shell of that node run:
cluster1-01> fsecurity show /vol/studentX_cifs_child
What do you see?
Find on which node the aggregate where studentX_root lives is
hosted on
From node shell of that node run
.
cluster1-01> fsecurity show /vol/studentX_root
What do you see?
12. From ngsh run::
cluster1::*> volume modify -vserver studentX -volume studentX_root -unix-permissions 755
Queued private job: 167
Are you able to access both the shares now?
13. From ngsh run::
cluster1::*> volume modify -vserver studentX -volume studentX_cifs -security-style ntfs
Queued private job: 168
Does this resolve the issue?
Exercise 3: Authorization issues Time Estimate: 20 minutes
Step Action
1. From a client go Start -> Run -> \\studentX\studentX
What do you see?
2. Try to view the permissions on the share. What do you see?
3. From the nodeshell of the node where the volume and its aggregate is hosted run:
cluster1-01> fsecurity show /vol/student1_cifs
[/vol/student1_cifs - Directory (inum 64)]
Security style: NTFS
Effective style: NTFS
DOS attributes: 0x0010 (----D---)
Unix security:
uid: 0
gid: 0
mode: 0777 (rwxrwxrwx)
NTFS security descriptor:
Owner: S-1-5-32-544
Group: S-1-5-32-544
DACL:
Allow - S-1-5-21-3281022357-2736815186-1577070138-500 - 0x001f01ff (Full Control)
4. From the above command, obtain the sid of the owner of the volume.
From ngsh run:
cluster1::*> diag secd authentication translate -node local -vserver studentX -sid S-1-5-32-544
What do you see?
5. How do you resolve this issue?
Exercise 4: Export Policies Time Estimate: 20 minutes
Step Action
1. Try to access \\studentX\studentX
What do you see?
2. What error do you see?
3. What does the event log show? What about the secd log? (Exercise 2 steps 3 and 8)
4. From nodeshell of the node that hosts the volume and its aggregate run:
“fsecurity show /vol/studentX_cifs”
Do the permissions show that access should be allowed?
5. From clustershell obtain the name of the export-policy associated with the volume as follows:
cluster1::> volume show -vserver studentx -volume student1_cifs -fields policy
Now view details of the export-policy obtained in the previous command
cluster1::> export-policy rule show -vserver studentX -policyname <policy name obtained from the above command>
cluster1::> export-policy rule show -vserver studentX -policyname <policy name obtained from the above command> -ruleindex <rule index applicable>
What do you see?
How do you fix the issue?
MODULE 6: SCALABLE SAN
Exercise 1: Enable SAN features and create a LUN and connect via ISCSI Time Estimate: 20 minutes
Step Action
1. Review your SAN configuration on the cluster.
- Licenses
- SAN protocol services
- Interfaces
2. Create a lun in your studentX_san volume.
3. Create an igroup and add the ISCSI IQN of your host to the group.
4. Configure the ISCSI initiator
5. Map the lun and access from lab host. Format the lun and write data to it.
6. From clustershell
cluster1::*> iscsi show
What do you see?
cluster1::*> debug seqid show
What do you see?
7. 1. Locate the UUIDs of your iSCSI LIFs ::> debug smdb table vifmgr_virtual_interface show -lif-name <iscsi_lif>
2. Display the statistics for these LIFs cluster1::statistics*> show -node cluster1-01 -object iscsi_lif -counter iscsi_read_ops -instance <UUID obtained from the above command
EXERCISE 2
TASK 1: TROUBLESHOOT QUORUM ISSUES
In this task, you experience quorum failure on a node of the cluster.
STEP ACTION
1. Team member 1 login to console of clusterY-01 as admin
From here on this will be referred to as Node1
2. Team member 2 login to console of clusterY-02 as admin
From here on this will be referred to as Node2
3. Team member 1 on the Node 1 console ngsh
::> set diag
4. Team member 2 on the Node 2 console ngsh
::> set diag
5. Team member 2 on the Node 2 ngsh , verify cluster status
::*> cluster show
6. Team member 2 on the Node 2 ngsh, view the current LIFs:
::*> net int show
7. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
8. Team member 2 on the Node 2 ngsh, bring down the cluster network LIFs on the interface:
::*> net int modify -vserver clusterY-02 -lif clus1,clus2 -status-admin down
STEP ACTION
9. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
10. Team member 1 on the Node 1 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
11. On the Node 2 PuTTY interface, enable the cluster network LIFs on the interface:
::*> net int modify -vserver cluster1-02 -lif clus1,clus2 -status-admin up
12. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
What do you see?
13. Team member 1 on the Node 1 ngsh, view the current cluster kernel status:
::*> cluster kernel-service show -instance
What do you see?
14. cluster1::*> debug smdb table bcomd_info show
What do you see?
STEP ACTION
15. Team member 1on the Node 1 ngsh, view the current bcomd information:
cluster1::*> debug smdb table bcomd_info show
What do you see?
16. Team member 2 reboot Node2 to have it start participating in SAN quorum again:
::*> reboot –node clusterY-02
17. Team member 2 console log in on Node2 as admin
18. Team member 2 on Node2, verify cluster health:
::> cluster show
19. Team member 2 on Node2
::> set diag
20. Verify the cluster kernel to verify both nodes have a status of in quorum (INQ):
::*> cluster kernel-service show –instance
::*>debug smdb table bcomd_info show
TASK 2: TROUBLESHOOT LOGICAL INTERFACE ISSUES
In this task, you bring down the LIFs that are associated with a LUN.
STEP ACTION
1. Console login as admin on clusterY-0X, view the current LIFs:
::*> net int show
2. On your own, disable LIFs that are associated with studentX_iscsi and determine how this action impacts connectivity to your LUN on the Windows host.
END OF EXERCISE
Exercise 3: Diag level SAN debugging Time Estimate: 25 minutes
Step Action
1. What are two ways we can see where the nvfail option is set on a volume?
2. How would we clear an nvfail state if we saw it?
3. How would we show virtual disk object information for a lun?
4. How do you manually dump a rastrace?
MODULE 7: SNAPMIRROR
Exercise 1: Setting up Intercluster SnapMirror Time Estimate: 20 minutes
Step Action
1. From clustershell of cluster1 run:
cluster1::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited
Error: command failed: Volume "cluster2://student3/student3_dest" not found.
(Failed to contact peer cluster with address 192.168.81.193. No
intercluster LIFs are configured on this node.)
2. From clustershell of cluster1 run:
::>set diag
cluster1::*> cluster peer address stable show
What do you see?
cluster1::*> net ::>int show -role intercluster
What do you see?
cluster1::*> cluster peer show -instance
What do you see?
cluster1::*> cluster peer show health –instance
What do you see? .
3. Run the following command:
::*> cluster peer ping -type data
What do you see?
4. Run the following command:
::*> cluster peer ping -type icmp
What do you see now? What addresses, if any, seem to be having issues?
5. Run the following command:
::> job history show -event-type failed
What jobs are failing?
To examine why they are failing:
cluster1::*> event log show -node cluster1-01 -messagename cpeer*
Why are the jobs failing?
6. Try to modify the cluster peer. What happens?
cluster1::*> cluster peer modify -cluster cluster2 -peer-addrs 192.168.81.193,192.168.81.194 -timeout 60
7. How did you resolve the issue?
Exercise 2: Intercluster DP mirrors Time Estimate: 30 minutes
Step Action
1. From clustershell of cluster1 run: cluster1::*> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited
What error do you see?What might he be doing wrong?
2. From clustershell of cluster2 run:
cluster2::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited
What do you see?Why?
3. After correcting the issue, run the following command in clustershell of cluster2:
cluster2::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited
Does the command complete?
How do you verify the snapmirror exists?
::>snapmirror show
What do you see? Is the snapmirror functioning?
How do you get the mirror working if it’s not?
.
4. After the snapmirror is confirmed as functional, check to see how long it has been since the last update (snapmirror lag).
Exercise 3: LS Mirrors Time Estimate: 20 minutes
Step Action
1. Create two LS mirrors that point to your studentX_snapmirror volume.
clusterY::*> volume create -vserver studentX -volume studentX_LS_snapmirror -aggregate studentX -size 100MB -state online -type DP
[Job 265] Job succeeded: Successful
clusterY::*> volume create -vserver studentX -volume studentX_LS_snapmirror2 -aggregate studentX -size 100MB -state online -type DP
[Job 266] Job succeeded: Successful
clusterY::*> snapmirror create -source-path clusterY://studentX/studentX_snapmir ror -destination-path clusterY://studentX/studentX_LS_snapmirror2 -type LS
[Job 273] Job is queued: snapmirror create the relationship with destination clu [Job 273] Job succeeded: SnapMirror: done
clusterY::*> snapmirror create -source-path clusterY://studentX/studentX_snapmir ror -destination-path clusterY://studentX/studentX_LS_snapmirror -type LS
[Job 275] Job is queued: snapmirror create the relationship with destination clu [Job 275] Job succeeded: SnapMirror: done
What steps did you have to consider? Check the MSIDs and DSIDs for the source and destination volumes. What do you notice? clusterY::*> volume show -vserver studentX -fields msid,dsid
2. Attempt to initialize one of the mirrors using the “snapmirror initialize” command.
cluster1::*> snapmirror initialize -destination-path cluster1://student1/student1_LS_snapmirror
[Job 276] Job is queued: snapmirror initialize of destination cluster1://student1/student1_LS_snapmirror.
cluster1::*> snapmirror initialize -destination-path cluster1://student1/student1_LS_snapmirror2
[Job 277] Job is queued: snapmirror initialize of destination cluster1://student1/student1_LS_snapmirror2.
cluster1::*> job show
What happens? How would you view the status of the job? If it didn’t work, how would you fix it? Why didn’t it work?
cluster1::*> job history show -id 276
What do you see?
How do you fix it?
3. After initializing the LS mirrors, try to update the mirrors using “snapmirror update.”
clusterY::*> snapmirror update -destination-path clusterY://studentX/studentX_LS_snapmirror
[Job 279] Job is queued: snapmirror update of destination clusterY://studentX/studentX_LS_snapmirror.
clusterY::*> job show
What happens? How do you view the status of the job?
What is the issue?
4. Run the following command:
::> vol show -vserver studentX -fields junction-path
What do you see?
.
Mount the volume from the cluster shell.
::> vol nmount -vserver studentX -volume studentX_snapmirror –junction-path /student1_snapmirror
What do you see?
Run the following:
::> vol show -vserver studentX -fields junction-path
What do you see now?
Then remount the volume to a new junction path “studentX_snapmirror.”
::> vol mount -vserver studentX -volume studentX_snapmirror -junction-path /studentX_snapmirror
Now what do you see?
Unmount the volume from the cluster shell. ::> vol unmount -vserver studentX -volume studentX_snapmirror Run the following: ::> vol show -vserver studentX -fields junction-path
What do you see now?
Then remount the volume to a new junction path “studentX_snapmirror.” ::> vol mount -vserver studentX -volume studentX_snapmirror - junction-path /studentX_snapmirror Now what do you see?
5. clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_snapmirror
clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_root
clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 000
clusterY::*> volume show -vserver studentX -fields unix-permissions
What do you see?
Mount the volume from your Linux host using –o nfsvers=3:
[root@nfshost DATAPROTECTION]# mount -o nfsvers=3 student1:/student1_snapmirror /cmode
[root@nfshost DATAPROTECTION]# cd /cmode
[root@nfshost cmode]# ls
[root@nfshost cmode]# cd
[root@nfshost ~]# ls -latr /cmode
Now execute:
[root@nfshost ~]# umount /cmode
From clustershell run:
clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_snapmirror
From Linux Host run:
[root@nfshost ~]# mount -o nfsvers=3 student1:/student1_snapmirror /cmode
[root@nfshost ~]# ls -latd /cmode
What do you see?
Modify the volume back to 777 on the cluster (using vol modify)
clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 777
Queued private job: 162
Check permissions on the unix host again.
[root@nfshost ~]# ls -latd /cmode
ls: /cmode: Permission denied
[root@nfshost ~]# cd /cmode
What do you see?
Are you able to cd into the mount now?
Update the LS mirror set.
clusterY::*> snapmirror update-ls-set -source-path
clusterY://studentX/studentX_snapmirror
What do you see in ls on the host? Why?
Modify the source volume to 000
clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 000
Queued private job: 163
What do you see in ls on the host? Why?