int eng ilt cmodetrbl exerciseguide

64
MODULE 1: KERNEL Exercise 1: Recovering from a boot loop Time Estimate: 20 minutes Step Action 1. Log in to the clustershell and execute the following command cluster1::> cluster show Node Health Eligibility --------------------- ------- ------------ cluster1-01 true true cluster1-02 false true cluster1-03 true true cluster1-04 true true 4 entries were displayed. 2. Note that the health of node clusterX-02 is false. Try and log in to the nodeshell of clusterX-02 to find out the problem. If unable to access nodeshell of clusterX-02, try and access it through its console. What do you see? 3. How do you fix this?

Upload: karthick

Post on 09-Dec-2015

49 views

Category:

Documents


14 download

DESCRIPTION

Int Eng Ilt Cmodetrbl Exerciseguide

TRANSCRIPT

Page 1: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 1: KERNEL

Exercise 1: Recovering from a boot loop Time Estimate: 20 minutes

Step Action

1. Log in to the clustershell and execute the following command

cluster1::> cluster show

Node Health Eligibility

--------------------- ------- ------------

cluster1-01 true true

cluster1-02 false true

cluster1-03 true true

cluster1-04 true true

4 entries were displayed.

2. Note that the health of node clusterX-02 is false.

Try and log in to the nodeshell of clusterX-02 to find out the problem.

If unable to access nodeshell of clusterX-02, try and access it through its console.

What do you see?

3. How do you fix this?

Page 2: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 2: M-HOST

Exercise 1: Fun with mgwd and mroot Time Estimate: 20 minutes

Step Action

1. On a node which does not own epsilon log in as admin to your cluster via console and go into systemshell.

::> set diag

::*> systemshell local

2. Execute the following:

% ps -A|grep mgwd

913 ?? Ss 0:11.76 mgwd -z

2794 p1 DL+ 0:00.00 grep mgwd

The above listing shows that the process id of the running instance of mgwd on this node is 913

Kill mgwd as follows

%sudo kill <pid of mgwd as obtained from above>

3. You see the following? Why?

server closed connection unexpectedly: No such file or directory

login:

Login as admin again as shon below:

server closed connection unexpectedly: No such file or directory

login:admin

Password:

What happens ?

4. You are now in clustershell. Drop to systemshell as follows:

::> set diag

Page 3: Int Eng Ilt Cmodetrbl Exerciseguide

::*> systemshell local

In systemshell execute the following:

% cd /etc

% sudo ./netapp_mroot_unmount

% exit

logout

When would we expect the node to use/need this script?

5. Now you are back in clustershell. Execute the following:

cluster1::> set diag

Warning: These diagnostic commands are for use by NetApp personnel only.

Do you want to continue? {y|n}: y

cluster1::*> cluster show

Node Health Eligibility Epsilon

-------------------- ------- ------------ ------------

cluster1-01 true true true

cluster1-02 true true false

cluster1-03 true true false

cluster1-04 true true false

4 entries were displayed.

cluster1::*> vol modify -vserver studentX -volume studentX_nfs -size 45M

(volume modify)

Error: command failed: Failed to queue job 'Modify studentX_nfs'. IO error in

local job store

cluster1::*> cluster show

Node Health Eligibility Epsilon

-------------------- ------- ------------ ------------

cluster1-01 false true true

Page 4: Int Eng Ilt Cmodetrbl Exerciseguide

cluster1-02 false true false

cluster1-03 false true false

cluster1-04 false true false

4 entries were displayed.

Do we see a difference in cluster show? If so, why? What’s broken?

6. To fix this without rebooting and without manually re-mounting /mroot restart mgwd.

7. Which phase in the boot process could we see this behavior occurring?

Exercise 2: Configuration backup and recovery

Time Estimate: 40 minutes

Action

1. Run the following commands:

::> set advanced

::*> man system configuration backup create

::*> man system configuration recovery node

::*> man system configuration recovery cluster

::*> system configuration backup show –node nodename

What do each of the commands show?

Page 5: Int Eng Ilt Cmodetrbl Exerciseguide

2. Where in systemshell can you find the files listed above?

3. Create a new system configuration backup of the node and the cluster as follows:

cluster1::*> system configuration backup create -node cluster1-01 -backup-type

node -backup-name cluster1-01.node

[Job 164] Job is queued: Local backup job.

::*> job private show

::*> job private show –id [Job id given as output of the backup create command above]

::*> job private show -id [id as above] -fields uuid

::*> job store show -id [uuid obtained from the command above]

cluster1::*> system configuration backup create -node cluster1-01 -backup-type

cluster -backup-name cluster1-01.cluster

[Job 495] Job is queued: Cluster Backup OnDemand Job.

::>job show

4. The following KB shows how to scp the backup files you created, as well as one of the system-created backups off to the Linux client:

https://kb.netapp.com/support/index?page=content&id=1012580

Use the following to install p7zip on your Linux client and use it to unzip the backup files.

# yum install p7zip

This is the recommended practice on live nodes however for vsims scp does not work.

So in the current lab setup ,drop to the systemshell and cd to /mroot/etc/backups/config

Unzip the system created backup file by doing the following:

Page 6: Int Eng Ilt Cmodetrbl Exerciseguide

% 7za e [system created backup file name]

What is in this file?

cd into one of the folders created by the unzip. There will be another 7z file. Extract it:

% 7za e [file name]

What’s in this file?

Extract the file:

% 7za e [file name]

What’s inside of it?

Compare it to what is in /mroot/etc of one of the cluster nodes. What are some of the differences?

5. cd into “cluster_config” in the backup. What is different from /mroot/etc/cluster_config on the node?

6. cd into “cluster_replicated_records” at the root of the folder you originally extracted the backup to and issue an “ls” command.

What do you see?

7. Unzip the node and cluster backups you created. What do you notice about the contents of these files?

Page 7: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 3: Moving mroot to a new aggregate

Time Estimate: 30 minutes

Step Action

1. Move a node’s root volume to a new aggregate.

Work with your lab partners and do this on only one node.

For live nodes the following KB contains the steps to do this:

https://kb.netapp.com/support/index?page=content&id=1013350&actp=LIST

However for vsims the root volume that is created by default is only 20MB and too small to hold the cluster configuration information.

Hence follow the steps given below:

2. Run the following command to create a new 3-disk aggregate on the desired node :

cluster1::> aggr create -aggregate new_root -diskcount 3 -nodes local

[Job 276] Job succeeded: DONE

cluster1::> aggr show -nodes local

Aggregate Size Available Used% State #Vols Nodes RAID Status

--------- -------- --------- ----- ------- ------ ---------------- ------------

aggr0_cluster1_02_0

900MB 15.45MB 98% online 1 cluster1-02

Page 8: Int Eng Ilt Cmodetrbl Exerciseguide

raid_dp,

normal

student2 900MB 467.4MB 48% online 8 cluster1-02 raid_dp,

normal

2 entries were displayed.

3. Ensure that the node does not own an epsilon. If it does, run the following command to move it to another node in the cluster:

cluster1::> set diag

Warning: These diagnostic commands are for use by NetApp personnel only.

Do you want to continue? {y|n}: y

cluster1::*> cluster show

Node Health Eligibility Epsilon

-------------------- ------- ------------ ------------

cluster1-01 true true false

cluster1-02 true true true

cluster1-03 true true false

cluster1-04 true true false

4 entries were displayed.

Run the following command to move the epsilon and modify it to 'false' on the owning node:

::*> cluster modify -node cluster1-02 -epsilon false

Then, run the following command to modify it to 'true' on the desired node:

::*> cluster modify -node cluster1-01 -epsilon true

::*> cluster show

Node Health Eligibility Epsilon

-------------------- ------- ------------ ------------

cluster1-01 true true true

cluster1-02 true true false

Page 9: Int Eng Ilt Cmodetrbl Exerciseguide

cluster1-03 true true false

cluster1-04 true true false

4 entries were displayed.

4. Run the following command to set the cluster eligibility on the node to 'false':

::*> cluster modify -node cluster1-02 -eligibility false

Note: This action must be performed on a node that is not to be marked as ineligible.

5. Run the following command to reboot the node into maintenance mode

cluster1::*> reboot local

(system node reboot)

Warning: Are you sure you want to reboot the node? {y|n}: y

login:

Waiting for PIDS: 718.

Waiting for PIDS: 695.

Terminated

.

Uptime: 2h12m14s

System rebooting...

\

Hit [Enter] to boot immediately, or any other key for command prompt.

Booting...

x86_64/freebsd/image1/kernel data=0x7ded08+0x1376c0 syms=[0x8+0x3b7f0+0x8+0x274a 8]

x86_64/freebsd/image1/platform.ko size 0x213b78 at 0xa7a000

NetApp Data ONTAP 8.1.1X34 Cluster-Mode

Copyright (C) 1992-2012 NetApp.

All rights reserved.

md1.uzip: 26368 x 16384 blocks

md2.uzip: 3584 x 16384 blocks

*******************************

Page 10: Int Eng Ilt Cmodetrbl Exerciseguide

* *

* Press Ctrl-C for Boot Menu. *

* *

*******************************

^CBoot Menu will be available.

Generating host.conf.

Please choose one of the following:

(1) Normal Boot.

(2) Boot without /etc/rc.

(3) Change password.

(4) Clean configuration and initialize all disks.

(5) Maintenance mode boot.

(6) Update flash from backup config.

(7) Install new software first.

(8) Reboot node.

Selection (1-8)? 5

….

WARNING: Giving up waiting for mroot

Tue Sep 11 11:23:27 UTC 2012

*> Sep 11 11:23:28 [cluster1-02:kern.syslog.msg:info]: root logged in from SP NONE

*>

6. Run the following command to set the options for the new aggregate to become the new root:

Note: It might be required to set the aggr options to CFO instead of SFO:

*> aggr options new_root root

aggr options: This operation is not allowed on aggregates with sfo HA Policy

Page 11: Int Eng Ilt Cmodetrbl Exerciseguide

*> aggr options new_root ha_policy cfo

Setting ha_policy to cfo will substantially increase the client outage during giveback for cluster volumes on aggregate new_root.

Are you sure you want to proceed? y

*> aggr options new_root root

Aggregate 'new_root' will become root at the next boot.

*>

7. Run the following command to reboot the node: *> halt

Sep 11 11:27:49 [cluster1-02:kern.cli.cmd:debug]: Command line input: the command is 'halt'. The full command line is 'halt'.

.

Uptime: 6m26s

The operating system has halted.

Please press any key to reboot.

System halting...

\

Hit [Enter] to boot immediately, or any other key for command prompt.

Booting in 1 second...

8. Once the node is booted, a new root volume named AUTOROOT will be created. In addition, the node will not be in quorum yet. This is because the new root volume will not be aware of the cluster.

login: admin

Password:

***********************

** SYSTEM MESSAGES **

***********************

A new root volume was detected. This node is not fully operational. Contact

Page 12: Int Eng Ilt Cmodetrbl Exerciseguide

support personnel for the root volume recovery procedures.

cluster1-02::>

9. Increase the size of AUTOROOT on the node by doing the following

Log in to the systemshell of a node which is in quorum and execute the following d-blade zapis to

a) Get the uuid of volume AUTOROOT of the node where root volume was changed

b) Increase the size of the same AUTOROOT volume by 500m

c) Check if the size is successfully changed

% zsmcli -H <cluster ip address of the node where new root volume was created> d-volume-list-info-iter-start desired-attrs =name,uuid

<results status="passed">

<next-tag>cookie=0,desired_attrs=name,uuid</next-tag>

</results>

% zsmcli -H <cluster ip address of the node where new root volume was created> d-volume-list-info-iter-next maximum-record s=10 tag='cookie=0,desired_attrs=name,uuid'

<results status="passed">

<volume-attrs>

<d-volume-info>

<name>vol0</name>

<uuid>014df353-bbc1-11e1-bb4c-123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>student2_root</name>

<uuid>044f53fa-e784-11e1-ab6e-123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>student2_LS_root</name>

<uuid>0ea7ae4c-e790-11e1-ab6e-

Page 13: Int Eng Ilt Cmodetrbl Exerciseguide

123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>AUTOROOT</name>

<uuid>30d8f742-fc04-11e1-bbf5-123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>student2_cifs</name>

<uuid>b8868843-e788-11e1-ab6e-123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>student2_cifs_child</name>

<uuid>c07f13ce-e788-11e1-ab6e-123478563412</uuid>

</d-volume-info>

<d-volume-info>

<name>student2_nfs</name>

<uuid>c861f83b-e788-11e1-ab6e-123478563412</uuid>

</d-volume-info>

% zsmcli -H 192.168.71.33 d-volume-set-info desired-attrs=size id=30d8f742-fc04-11e1-bbf5-123478563412 volume-attrs='[d-volume-info=[size=+500m]]'

<results status="passed"/>

% zsmcli -H 192.168.71.33 d-volume-list-info id=30d8f742-fc04-11e1-bbf5-123478563412 desired-attrs=size

<results status="passed">

<volume-attrs>

<d-volume-info>

<size>525m</size>

</d-volume-info>

</volume-attrs>

</results>

10. Clear the root recovery flags if required by doing the following:

Log in to the systemshell of the node where the new root volume was created and

Page 14: Int Eng Ilt Cmodetrbl Exerciseguide

check if the bootarg.init.boot_recovery bit is set

% sudo kenv bootarg.init.boot_recovery

If a value is returned, and it is not kenv: unable to get bootarg.init.boot_recovery, clear the bit.

% sudo sysctl kern.bootargs=--bootarg.init.boot_recovery

kern.bootargs: ->

Check that the bit is cleared

% sudo kenv bootarg.init.boot_recovery

kenv: unable to get bootarg.init.boot_recovery

%

11. From a healthy node, with all nodes booted, run the following command: ::*> system configuration recovery cluster rejoin -node <the node where new root volume was created>

Warning: This command will rejoin node "cluster1-02" into the local cluster, potentially overwriting critical cluster

configuration files. This command should only be used to recover from a disaster. Do not perform any other recovery

operations while this operation is in progress. This command will cause node "cluster1-02" to reboot.

Do you want to continue? {y|n}: y

Node "cluster1-02" is rebooting. After it reboots, verify that it joined the new cluster.

12. After a boot, check the cluster to ensure that the node is back and eligible:

cluster1::> cluster show

Node Health Eligibility

--------------------- ------- ------------

cluster1-01 true true

cluster1-02 true true

cluster1-03 true true

cluster1-04 true true

4 entries were displayed.

13. If the cluster is still not in quorum, run the following command:

::*> system configuration recovery cluster sync <node where new root

Page 15: Int Eng Ilt Cmodetrbl Exerciseguide

volume was created> Warning: This command will synchronize node "cluster1-02" with the cluster configuration, potentially overwriting critical cluster configuration files on the node. This feature should only be used to recover from a disaster. Do not perform any other recovery operations while this operation is in progress. This command will cause all the cluster applications on node "node4" to restart, interrupting administrative CLI and Web interface on that node. Do you want to continue? {y|n}: y All cluster applications on node "cluster1-02" will be restarted. Verify that the cluster applications go online.

14. After the node is in quorum, run the following command to add the new root vol to VLDB. This is necessary because it is a 7-Mode volume and will not be displayed until it is added:

cluster1::> set diag cluster1::*> vol show -vserver cluster1-02

(volume show)

Vserver Volume Aggregate State Type Size Available Used%

--------- ------------ ------------ ---------- ---- ---------- ---------- -----

cluster1-02

vol0 aggr0_cluster1_02_0

online RW 851.5MB 283.3MB 66% cluster1::*> vol add-other-volumes -node cluster1-02

(volume add-other-volumes)

cluster1::*> vol show -vserver cluster1-02

(volume show)

Vserver Volume Aggregate State Type Size Available Used%

--------- ------------ ------------ ---------- ---- ---------- ---------- -----

cluster1-02

AUTOROOT new_root online RW 525MB 379.2MB 27%

cluster1-02

vol0 aggr0_cluster1_02_0

online RW 851.5MB

Page 16: Int Eng Ilt Cmodetrbl Exerciseguide

283.3MB 66%

2 entries were displayed.

15. Run the following command to remove the old root volume from VLDB

cluster1::*> vol remove-other-volume -vserver cluster1-02 -volume vol0

(volume remove-other-volume)

cluster1::*> vol show -vserver cluster1-02

(volume show)

Vserver Volume Aggregate State Type Size Available Used%

--------- ------------ ------------ ---------- ---- ---------- ---------- -----

cluster1-02

AUTOROOT new_root online RW 525MB 379.2MB 27%

16. Destroy the old root vol by running the following command from the node shell of the node where the new root volume has been created

cluster1::*> node run local

Type 'exit' or 'Ctrl-D' to return to the CLI

cluster1-02> vol status vol0

Volume State Status Options

vol0 online raid_dp, flex nvfail=on

64-bit

Volume UUID: 014df353-bbc1-11e1-bb4c-123478563412

Containing aggregate: 'aggr0_cluster1_02_0'

cluster1-02> vol offline vol0

Volume 'vol0' is now offline.

cluster1-02> vol destroy vol0

Are you sure you want to destroy volume 'vol0'? y

Volume 'vol0' destroyed.

And the old root aggr can be destroyed if desired:

From cluster shell:

cluster1::*> aggr show -node <node where new root vol was

Page 17: Int Eng Ilt Cmodetrbl Exerciseguide

created>

Aggregate Size Available Used% State #Vols Nodes RAID Status

--------- -------- --------- ----- ------- ------ ---------------- ------------

aggr0_cluster1_02_0

900MB 899.7MB 0% online 0 cluster1-02 raid_dp,

normal

new_root 900MB 371.9MB 59% online 1 cluster1-02 raid_dp,

normal

student2 900MB 467.2MB 48% online 8 cluster1-02 raid_dp,

normal

3 entries were displayed.

cluster1::*> aggr delete -aggregate <old root aggregate name>

Warning: Are you sure you want to destroy aggregate "aggr0_cluster1_02_0"?

{y|n}: y

[Job 277] Job succeeded: DONE

17. Use the following KB rename the root volume(AUTOROOT) to vol0 https://kb.netapp.com/support/index?page=content&id=2015985

18. What sort of things regarding the root vol did you observe during this?

Page 18: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 4: Locate and Repair Aggregate Issues

Time Estimate: 15 minutes

Action

1. Login to clustershell of clusterX and execute the following: ::> aggr show -aggregate VLDBX (team member 1 use X=1 and team

member 2 use X = 2) There are no entries matching your query.

One aggregate is showing as missing from the cluster shell:

Execute the following:

::> aggr show -aggregate WAFLX -instance Aggregate: WAFLX Size: - Used Size: - Used Percentage: - Available Size: - State: unknown Nodes: cluster1-02

Another aggregate is showing as “unknown”:

Fix the issue.

2. Issue the following command. Do you see anything wrong?

::*> debug vreport show aggregate

3. What nodes do the aggregates belong to? How do you know?

4. Use the “debug vreport fix” command to resolve the problem.

5. List some of the reasons why customers could have this problem.

Page 19: Int Eng Ilt Cmodetrbl Exerciseguide

6. Was any data lost? If so, which aggregate?

Exercise 5: Replication failures

Time Estimate: 20 minutes

Action

1. Note:Participants working with cluster2 should replace student1 with student3 and student2 with student4 in all the steps of this exercise

Log in to systemshell clusterX-02 (make sure it does not own epsilon)

Unmount mroot and clus and prevent mgwd from being monitored by spmctl, as follows:

% sudo umount -f /mroot

% sudo umount -f /clus

% spmctl -d -h mgwd

2. Login to ngsh on clusterX-02 and execute the following:

cluster1::*> volume create -vserver student1 -volume test -aggregate

Info: Node cluster1-01 that hosts aggregate aggr0 is offline

Node cluster1-03 that hosts aggregate aggr0_cluster1_03_0 is offline

Node cluster1-04 that hosts aggregate aggr0_cluster1_04_0 is offline

Node cluster1-01 that hosts aggregate student1 is offline

aggr0 aggr0_cluster1_03_0 aggr0_cluster1_04_0

new_root student1 student2

cluster1::*> volume create -vserver student1 -volume test -aggregate student2

Error: command failed: Replication service is offline

cluster1::*> net int create -server student1 -lif test -role data -home-node cluster1-02 -home-port e0c -address

Page 20: Int Eng Ilt Cmodetrbl Exerciseguide

10.10.10.10 -netmask 255.255.255.0 -status-admin up

(network interface create)

Info: An error occurred while creating the interface, but a new routing group

d10.10.10.0/24 was created and left in place

Error: command failed: Local unit offline

cluster1::*> vserver create -vserver test -rootvolume test -aggregate student1 -ns-switch file -rootvolume-security-style unix

Info: Node cluster1-01 that hosts aggregate student1 is offline

Error: create_imp: create txn failed

command failed: Local unit offline

3. Login to ngsh on clusterX-01 and execute the following:

cluster1::> volume create test -vserver student2 -aggregate

Info: Node cluster1-02 that hosts aggregate new_root is offline

Node cluster1-02 that hosts aggregate student2 is offline

aggr0 aggr0_cluster1_03_0 aggr0_cluster1_04_0

new_root student1 student2

cluster1::> volume create test -vserver student2 -aggregate student2 -size 20MB

Info: Node cluster1-02 that hosts aggregate student2 is offline

Error: command failed: Failed to create the volume because cannot determine the

state of aggregate student2.

cluster1::> volume create test -vserver student2 -aggregate student1 -size 20MB

[Job 368] Job succeeded: Successful

Note: when a volume is created on an aggregate not hosted on clusterX-02 , the volume create succeeds

cluster1::> net int create -vserver student1 -lif data2 -role data -data-protocol nfs,cifs,fcache -home-node cluster1-02 -home-port e0c -address 10.10.10.10 -netmask 255.255.255.0

Page 21: Int Eng Ilt Cmodetrbl Exerciseguide

(network interface create)

Info: create_imp: Failed to create virtual interface

Error: command failed: Routing group d10.10.10.0/24 not found

cluster1::> net int create -vserver student1 -lif data2 -role data -data-protocol nfs,cifs,fcache -home-node cluster1-01 -home-port e0c -address 10.10.10.10 -netmask 255.255.255.0

(network interface create)

Note: when an interface is created on port not hosted on clusterX-02 the interface create succeeds

cluster1::*> vserver create -vserver test -rootvolume test -aggregate student2 -ns-switch file -rootvolume-security-style unix

Info: Node cluster1-02 that hosts aggregate student2 is offline

Error: create_imp: create txn failed

command failed: Local unit offline

cluster1::*> vserver create -vserver test -rootvolume test -aggregate student1 -ns-switch file -rootvolume-security-style unix

[Job 435] Job succeeded: Successful

Note: when a vserver is created and its root volume is created an aggregate that is not hosted on clusterX-02 the vserver create succeeds

4. Log in to systemshell of clusterX-02.

Execute the following:

cluster1-02% mount

/dev/md0 on / (ufs, local, read-only)

devfs on /dev (devfs, local)

/dev/ad0s2 on /cfcard (msdosfs, local)

/dev/md1.uzip on / (ufs, local, read-only, union)

/dev/md2.uzip on /platform (ufs, local, read-only)

/dev/ad3 on /sim (ufs, local, noclusterr, noclusterw)

/dev/ad1s1 on /var (ufs, local, synchronous)

procfs on /proc (procfs, local)

Page 22: Int Eng Ilt Cmodetrbl Exerciseguide

/dev/md3 on /tmp (ufs, local, soft-updates)

/mroot/etc/cluster_config/vserver on /mroot/vserver_fs

vserverfs, union)

Note that /mroot and /clus are not mounted

5 From systemshell of clusterX-02 run following commands:

% rdb_dump

What do you see?

%tail -100 /mroot/etc/mlog/mgwd.log |more

What do you see?

Log in to systemshell of cluster-01 and run the following command

%tail -100 /mroot/etc/mlog/mgwd.log |more

What do you see?

6. From systemshell of clusterX-02 run:

%spmctl

What do you see?

6. What happened?

7. Fixing these issues:

a) Re-add mgwd to spmctl with:

% ps aux | grep mgwd

root 779 0.0 17.6 303448 133136 ?? Ss 1:53PM 0:44.12 mgwd -z

diag 3619 0.0 0.2 12016 1204 p2 S+ 4:39PM 0:00.00 grep mgwd

% spmctl -a -h mgwd -p 779

b) Then restart mgwd which will mount /mroot and /clus

% sudo kill <PID>

Page 23: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 6: Troubleshooting Autosupport

Time Estimate: 20 minutes

Action

1. From clustershell of each node send a test autosupport as follows: (y takes the values 1,2,3,4)

::*> system autosupport invoke -node clusterX-0y -type test

You will see an error such as:

Error: command failed: RPC: Remote system error - Connection refused

2. Let’s find out Why? Connection refused means that we couldn't talk to the application for some reason. In this case, notifyd is the application. When we look at systemshell for the process, it's not there: cluster1-01% ps aux | grep notifyd diag 5442 0.0 0.2 12016 1160 p0 S+ 9:20PM 0:00.00 grep notifyd

3. spmctl manages notifyd

We can check to see why spmctl didn't start notifyd back up:

cluster-1-01% cat spmd.log | grep -i notify 0000002e.00001228 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x800702d30: INFO: spmd::ProcessController: sendShutdownSignal:process_controller.cc:186 sending SIGTERM to 5498: 0000002e.00001229 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:152 kevent returned: 1 0000002e.0000122a 0002ba73 Tue Aug 09 2011 21:26:31 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessControlManager: dumpExitConditions:process_control_manager.cc:732 process (notifyd:5498) exited on signal 15 0000002e.0000122b 0002ba7d Tue Aug 09 2011 21:26:32 +00:00 [kern_spmd:info:739] 0x8007023d0: INFO: spmd::ProcessWatcher: _run:process_watcher.cc:148

Page 24: Int Eng Ilt Cmodetrbl Exerciseguide

wait for incoming events. And then we check spmctl to see if it's still monitoring notifyd: cluster-1-01% spmctl | grep notify In this case, it looks like notifyd got removed from spmctl and we need to re-add it: cluster-1-01% spmctl -e -h notifyd cluster-1-01% spmctl | grep notify Exec=/sbin/notifyd -n;Handle=56548532-c334-4633-8cd8- 77ef97682d3d;Pid=15678;State=Running cluster-1-01% ps aux | grep notify

root 15678 0.0 6.7 112244 50568 ?? Ss 4:06PM 0:02.42 /sbin/notifyd –

diag 15792 0.0 0.2 12016 1144 p2 S+ 4:06PM 0:00.00 grep notify

4. Try to send a test autosupport.

::*> system autosupport invoke -node clusterX-0y -type test

What happens?

Page 25: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 3: SCON

Exercise 1: Vifmgr and MGWD interaction

Time Estimate: 30 minutes

Step Action

1. Try to create an interface:

clusterX::*> net int create -vserver studentY -lif test -role data -data-protocol nfs,cifs,fcache -home-node clusterX-02 -home-port

You see the following error:

Warning: Unable to list entries for vifmgr on node clusterX-02. RPC: Remote

system error - Connection refused

{<netport>|<ifgrp>} Home Port

2. Ping interfaces of clusterX-02 the node whose ports seem inaccessible

clusterX::*> cluster ping-cluster -node clusterX-02

What do you see?

3. Perform data access:

Attempt cifs access to \\student2\student2(cluster1) or \\student4\student4(cluster2) from the windows machine

What happens?

4. Execute the following:

clusterX::*> net int show

What do you see?

5. Run net port show clusterX::*> net port show What do you see?

6. Check the system logs: clusterX::*> debug log files modify -incl-files vifmgr,mgwd clusterX::*> debug log show –node clusterX-02 –timestamp Mon

Page 26: Int Eng Ilt Cmodetrbl Exerciseguide

Oct 10* What do you see?

7. Log in to systemshell on clusterX-02 and run ps to see if vifmgr is running: clusterX-02% ps -A |grep vifmgr

8. Run rdb_dump from systemshell of clusterX-02 clusterX-02% rdb_dump What do you see?

9. Run the following from systemshell of clusterX-02: clusterX-02% spmctl | grep vifmgr What do you see

10. In cluster shell execute cluster ring show clusterX::*> cluster ring show

11. What is the Issue? How do you fix it?

Page 27: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 2: Duplicate lif IDs

Time Estimate: 30 minutes

Step Action

1.

From the clustershell create a new network interface as follows: Y E {1,2,3,4}

clusterX::*> net int create -vserver studentY -lif data1 -role data -data-protocol nfs,cifs,fcache -home-node clusterX-0Y -home-port e0c -address 192.168.81.21Y -netmask 255.255.255.0 -status-admin up

(network interface create)

Info: create_imp: Failed to create virtual interface

Error: command failed: Duplicate lif id

2. Execute the following:

clusterX::*> net int show

What do you see?

3. View the mgwd log file on the node where you are giving the net int create command and determine the lifid which is eing reported as duplicate

4.

Execute the following:

clusterX::*>debug smdb table vifmgr_virtual_interface show -node clusterX-0* -lif-id [lifid/vifid determined from step 3]

What do you see?

Page 28: Int Eng Ilt Cmodetrbl Exerciseguide

5. Execute the following:

clusterX::*> debug smdb table vifmgr_virtual_interface delete -node clusterX-0Y –lif-id <the duplicate id >

clusterX::*> debug smdb table vifmgr_virtual_interface show -node clusterX-0Y -lif-id <the duplicate id>

There are no entries matching your query.

5. Create new lif:

clusterX::*> net int create -vserver studentY -lif testY -role data -data-protocol

nfs,cifs,fcache -home-node clusterX-0Y -home-port e0c -address 192.168.81.21Y -netmask 255.255.255.0 -status-admin up

(network interface create)

Page 29: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 4: NFS

Exercise 1: Mount issues Time Estimate: 20 minutes

Step Action

1. From the Linux Host execute the following:

#mkdir /cmodeY

#mount studentY:/studentY_nfs /cmodeY

You See the following:

mount: mount to NFS server 'studentY' failed: RPC Error: Program not registered.

2. Find out the node being mounted:

From the Linux Host execute the following to find the IP address being accessed:

#ping studentY

PING studentY (192.168.81.115) 56(84) bytes of data.

64 bytes from studentY (192.168.81.115): icmp_seq=1 ttl=255 time=1.09 ms

From the clustershell use the following to find out the current node and port on which the above IP address is hosted

clusterX::*> net int show -vserver studentY -address 192.168.81.115 -fields curr-node,curr-port

(network interface show)

vserver lif curr-node curr-port

-------- -------------- ----------- ---------

studentY studentY_data1 clusterX-01 e0d

3. Execute the following to start a packet trace from the nodeshell of the node that was being mounted and attempt the mount once more

clusterX::*> run -node clusterX-01

Type 'exit' or 'Ctrl-D' to return to the CLI

clusterX-01> pktt start e0d

e0d: started packet trace

From the Linux Host attempt the mount once more as shown below:

Page 30: Int Eng Ilt Cmodetrbl Exerciseguide

# mount student1:/student1_nfs /cmode1

Back in the nodeshell of the node that was mounted dump and stop the packet trace

clusterX-01> pktt dump e0d

clusterX-01> pktt stop e0d

e0d: Tracing stopped and packet trace buffers released.

From the systemshell of the node where the packet trace was captured view the packet trace using tcpdump

clusterX-01> exit

logout

clusterX::*> systemshell -node clusterX-01

clusterX-01% cd /mroot

clusterX-01% ls

e0d_20120925_131928.trc home vserver_fs

etc trend

clusterX-01% tcpdump –r e0d_20120925_131928.trc

What do you see? Why?

4. How do you fix the issue?

5. After fixing the issue check that the mount is successful.

Note:If the mount succeeds please unmount.This step is very important or the rest of the exercises will be impacted

Exercise 2: Mount and access issues

Time Estimate: 30 minutes

Step Action

1. From the Linux Host attempt to mount volume studentX_nfs.

Page 31: Int Eng Ilt Cmodetrbl Exerciseguide

# mount studentX:/studentX_nfs /cmode

mount: studentX:/studentX_nfs failed, reason given by server: Permission denied

2. From clustershell execute the following to find the export policy associated with the volume studentX_nfs:

cluster1::*> vol show -vserver studentX -volume studentX_nfs –instance

Next use the “export-policy rule show” to find the properties of the export policy associated with the volume studentX_nfs

Why did you get an access denied error?

How will you fix the issue

3. Now once again attempt to mount studentX_nfs from the Linux Host

# mount studentX:/studentX_nfs /cmode

mount: studentX:/studentX_nfs failed, reason given by server: No such file or directory

What issue is occurring here?

4. Now once again attempt to mount studentX_nfs from the Linux Host

# mount studentX:/studentX_nfs /cmode

Is the mount successful?

If yes, cd into the mount point

#cd /cmode

-bash: cd: /cmode: Permission denied

How do you resolve this?

Note: Depending on how you resolved the issue with the export-policy in step 1 you may not see any error here.In that case move on to step 4

If you unmount and remount, does it still work?

Page 32: Int Eng Ilt Cmodetrbl Exerciseguide

5.

Try to write a file into the mount

[root@nfshost cmode]# touch f1

What does ls –la show?

[root@nfshost cmode]# ls -la

total 16

drwx------ 2 admin admin 4096 Sep 25 08:06 .

drwxr-xr-x 26 root root 4096 Sep 25 06:03 ..

-rw-r--r-- 1 admin admin 0 Sep 25 08:06 f1

drwxrwxrwx 12 root root 4096 Sep 25 08:05 .snapshot

What do you see the file permissions as?

Why are the permissions and owner set the way they are?

6. From clustershell Execute:

clusterX::> export-policy rule modify -vserver studentY -policyname studentY -ruleindex 1 -rorule any -rwrule any

(vserver export-policy rule modify)

Exercise 3: Stale file handle

Time Estimate: 30 minutes

Step Action

1. From the Linux Host execute:

# cd /nfsX

-bash: cd: /nfsX: Stale NFS file handle

Page 33: Int Eng Ilt Cmodetrbl Exerciseguide

2. Unmount the volume from the client and try to re-mount. What happens?

3. From the Linux Host:

# ping studentX

PING studentX (192.168.81.115) 56(84) bytes of data.

The underlined IP above is the IP of vserver being mounted.

Find the node in the cluster that is currently hosting this IP

From your clustershell

::*> net int show -address 192.168.81.115 -fields curr-node

(network interface show)

vserver lif curr-node

-------- -------------- -----------

studentX studentX_data1 clusterY-0X

The node underlined above is the node that is currently hosting the IP.

Log in to the systemshell of this node and view the vldb logs

cluster1::*> systemshell -node clusterY-0X

cluster1-01% tail /mroot/etc/mlog/vldb.log

What do you see?

4. Look for volumes with the MSID in the error shown in the vldb log as follows:

From clustershell execute the following to find the aggregate where the volume being mounted(nfs_studentX) lives and on which node that aggregate lives:

cluster1::*> vol show -vserver studentX -volume nfs_studentX -fields aggregate (volume show)

vserver volume aggregate

-------- ------------ ---------

studentX nfs_studentX studentX

cluster1::*> aggr show -aggregate studentX -fields nodes

aggregate nodes

--------- -----------

studentx clusterY-0X

Page 34: Int Eng Ilt Cmodetrbl Exerciseguide

Go to nodeshell of the node (underlined above) that hosts the volume and its aggregate and use the showfh command and convert the msid from hex.

::>run –node clusterY-0X

>priv set diag

*>showfh /vol/nfs_studentX

flags=0x00 snapid=0 fileid=0x000040 gen=0x5849a79f fsid=0x16cd2501 dsid=0x0000000000041e msid=0x00000080000420

0x00000080000420 converted to decimal is 2147484704

Exit from nodeshell back to clustershell abd execute debug vreport show in diag mode:

cluster1-01*> exit

logout

cluster1::*> debug vreport show

What do you see?

5. What is the issue here?

6. How would you fix this?

Page 35: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 5: CIFS

Instructions to Students: As mentioned in the lab handout the valid windows users in the domain Learn.NetApp.local are:

a) Administrator b) Student1 c) Student2

Exercise 1: Using diag secd

Time Estimate: 20 minutes

Step Action

1. Find the node where the IP(s) for vserver studentX is hosted

From the RDP machine do the following to start a command window

Start->Run->cmd

In the command window type

ping studentX

From the clustershell find the node on which the IP is hosted (Refer to NFS Exercise 3)

Login to the console of that node and execute the steps of this exercise

2. Type the following:

::> diag secd

What do you see and why?

3. Note: for all the steps of this exercise clusterY-0X should be the name of the local node

Type the following to verify the name mapping of windows user student1 ,.

::diag secd*> name-mapping show -node local -vserver studentX -direction win-unix -name student1

Page 36: Int Eng Ilt Cmodetrbl Exerciseguide

4. From the RDP machine do the following to access a cifs share

Start -> Run -> \\studentX

Type the following to query for the Windows SID of your windows user name

cluster1::diag secd*> authentication show-creds -node local -vserver studentX -win-name <username that you have used to RDP to the windows machine>

DC Return Code: 0

Windows User: Administrator Domain: LEARN Privs: a7

Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513

Domain: S-1-5-21-3281022357-2736815186-1577070138 Rids: 500, 572, 519, 518, 512, 520, 513

Domain: S-1-5-32 Rids: 545, 544

Domain: S-1-1 Rids: 0

Domain: S-1-5 Rids: 11, 2

Unix ID: 65534, GID: 65534

Flags: 1

Domain ID: 0

Other GIDs:

cluster1::diag secd*> authentication translate -node local -vserver student1 -win-name <username that you have used to RDP to the windows machine>

S-1-5-21-3281022357-2736815186-1577070138-500

5. Type the following to test a Windows login for your user windows name in diag secd

cluster1::diag secd*> authentication login-cifs -node local -vserver studentX -user <username that you have used to RDP to the windows machine>

Enter the password: <your windows password i.e Netapp123>

Windows User: Administrator Domain: LEARN Privs: a7

Primary Grp: S-1-5-21-3281022357-2736815186-1577070138-513

Domain: S-1-5-21-3281022357-2736815186-1577070138 Rids: 500, 513, 520, 512, 518, 519, 572

Domain: S-1-1 Rids: 0

Domain: S-1-5 Rids: 11, 2

Domain: S-1-5-32 Rids: 544

Page 37: Int Eng Ilt Cmodetrbl Exerciseguide

Unix ID: 65534, GID: 65534

Flags: 1

Domain ID: 0

Other GIDs:

Authentication Succeeded.

6. Type the following to view active CIFS connections in secd

cluster1::diag secd*> connections show -node clusterY-0X -vserver studentX

[ Cache: NetLogon/learn.netapp.local ]

Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms

Performance> Hits: 0, Misses: 1, Failures: 0, Avg Retrieval: 24505.00ms

(No connections active or currently cached)

[ Cache: LSA/learn.netapp.local ]

Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms

Performance> Hits: 1, Misses: 4, Failures: 0, Avg Retrieval: 6795.40ms

(No connections active or currently cached)

[ Cache: LDAP (Active Directory)/learn.netapp.local ]

Queue> Waiting: 0, Max Waiting: 1, Wait Timeouts: 0, Avg Wait: 0.00ms

Performance> Hits: 1, Misses: 3, Failures: 1, Avg Retrieval: 2832.75ms

(No connections active or currently cached)

Type the following to clear active CIFS connections in secd

cluster1::diag secd*> connection clear -node clusterY-0X –vserver studentX

Page 38: Int Eng Ilt Cmodetrbl Exerciseguide

Test connections on vserver student1 marked for removal.

NetLogon connections on vserver student1 marked for removal.

LSA connections on vserver student1 marked for removal.

LDAP (Active Directory) connections on vserver student1 marked for removal.

LDAP (NIS & Name Mapping) connections on vserver student1 marked for removal.

NIS connections on vserver student1 marked for removal.

7. Type the following to view the server discovery information

cluster1::diag secd*> server-discovery show-host -node clusterY-0X

Host Name: win2k8-01

Cifs Domain:

AD Domain:

IP Address: 192.168.81.10

Host Name: win2k8-01

Cifs Domain:

AD Domain:

IP Address: 192.168.81.253

Type the following to achieve the same result as ONTAP 7G’s “cifs resetdc”

cluster1::diag secd*> server-discovery reset -node clusterY-0X -vserver studentX

Discovery Reset succeeded for Vserver: student1

To verify type the following:

cluster1::diag secd*> server-discovery show-host -node clusterY-0X

Discovery Reset succeeded for Vserver: studentX

Type the following to achieve the same result as ONTAP 7G’s

“cifs testdc”?

Page 39: Int Eng Ilt Cmodetrbl Exerciseguide

cluster1::diag secd*> server-discovery test -node clusterY-0X -vserver studentX

Discovery Global succeeded for Vserver: studentX

8. Type the following to view current logging level in secd

cluster1::diag secd*> log show -node clusterY-0X

Log Options

----------------------------------

Log level: Debug

Function enter/exit logging: OFF

Type the following to set and view the current logging level in secd

cluster1::diag secd*> log set -node clusterY-0X -level err

Setting log level to "Error"

cluster1::diag secd*> log show -node clusterY-0X

Log Options

----------------------------------

Log level: Error

Function enter/exit logging: OFF

9. Type the following to enable tracing in secd to capture the logging level specified

cluster1::diag secd*> trace show -node local

Trace Spec

---------------------------------------

Trace spec has not been set.

cluster1::diag secd*> trace set -node cluster1-01 -trace-all yes

Trace spec set successfully for trace-all.

cluster1::diag secd*> trace show -node cluster1-01

Trace Spec

---------------------------------------

Page 40: Int Eng Ilt Cmodetrbl Exerciseguide

TraceAll: Tracing all RPCs

10. Type the following to check secd configuration for comparison with the ngsh settings?

cluster1::diag secd*> config query -node local -source-name

cifs-server kerberos-realm machine-account

nis-domain vserver vserverid-to-name

unix-group-membership local-unix-user local-unix-group

kerberos-keyblock ldap-config ldap-client-config

ldap-client-schema name-mapping nfs-kerberos

cifs-server-options cifs-server-security dns

cifs-preferred-dc virtual-interface routing-group-routes

secd-cache-config

cluster1::diag secd*> configuration query -node local -source-name machine-account

vserver: 5

cur_pwd: 0100962681ce82e2d6da20df35ce86964fea2c495d9609d395a5199431d3d4531144f845fcfd675e15143fe76932ced271ddcf57c9d8fe59a63b0bc68f717077fc88ca28aa0fdbba4b8d8509bb25ebe2

new_pwd:

installdate: 1345202770

sid: S-1-5-21-3281022357-2736815186-1577070138-1609

vserver: 6

cur_pwd: 01433517c8acbbf66c2e287b4bee56f5d8b707cfb69710737bfb20616ebe61fc31163acde2b5a827f3c2d395b89fef15f28a8f514c147906580cbaa30b4a1361444f76036d2c590222ce1a0feaa56779

new_pwd:

installdate: 1345202787

sid: S-1-5-21-3281022357-2736815186-1577070138-1610

Page 41: Int Eng Ilt Cmodetrbl Exerciseguide

11. Type the following to clear the cache(s) one at a time

cluster1::diag secd*> cache clear -node clusterY-0X -vserver studentX -cache-name

ad-to-netbios-domain netbios-to-ad-domain ems-delivery

ldap-groupid-to-name ldap-groupname-to-id ldap-userid-to-creds

ldap-username-to-creds log-duplicate name-to-sid

sid-to-name nis-groupid-to-name nis-groupname-to-id

nis-userid-to-creds nis-username-to-creds nis-group-membership

netgroup schannel-key lif-bad-route-to-target

cluster1::diag secd*> cache clear -node clusterY-0X -vserver studentX -cache-name ad-to-netbios-domain

Type the following to clear all caches together

cluster1::diag secd*> restart -node clusterY-0X

You are attempting to restart a process in charge of security services. Do not

restart this process unless the system has generated a "secd.config.updateFail"

event or you have been instructed to restart this process by support personnel.

This command can take up to 2 minutes to complete.

Are you sure you want to proceed? {y|n}: y

Restart successful! Security services are operating correctly.

12. From the RDP machine close the cifs share \\studentX opened in windows explorer

Page 42: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 2: Authentication issues Time Estimate: 30 minutes

Step Action

1. From the RDP machine access the cifs share \\studentX

Start->Run->\\studentX

What error message do you see?

2. Refer to step 1 of exercise 1 and

Find the node where the IP(s) for vserver studentX is hosted

Login to the console of that node and execute the steps of this exercise

From clustershell of the node , run the following commands:

::> set diag

::*> diag secd authentication translate -node local -vserver studentX -win-name <your windows username>

::*> diag secd authentication sid-to-uid -node local -vserver studentX -sid <sid from previous command>

::*> diag secd authentication show-creds -node local -vserver studentX -win-name <username>

Does the user seem to be functioning properly? If not, what error do you get?

3. Run the following command:

::> event log show

What message do you see?

4. Run the following command:

::> diag secd name-mapping show -node local -vserver

Page 43: Int Eng Ilt Cmodetrbl Exerciseguide

student1 -direction win-unix

-name <your windows username>

::> vserver name-mapping show -vserver studentX –direction win-unix –position *

::> cifs options show –vserver studentX

5. Which log in systemshell can we look at to see errors for this problem?

6. What issues did you find?

7. cluster1::*> unix-user create -vserver studentX -user pcuser -id 65534 -primary-gid 65534

(vserver services unix-user create)

cluster1::*> cifs option modify -vserver studentX -default-unix-user pcuser

8. The Windows Explorer window which opens when you navigate to Start->Run->\\studentX shows 2 shares .

a) studentX

b) studentX_child

Try to access the shares

What happens?

Do the following:

Enable debug logging for secd on the node that owns your data lifs

cluster1::*> diag secd log set -node local -level debug

Setting log level to "Debug"

cluster1::*> trace set -node local -trace-all yes

(diag secd trace set)

Trace spec set successfully for trace-all.

Close the CIFS session on the Windows host and run “net use /d *” from cmd to clear cached sessions and retry the connection

Page 44: Int Eng Ilt Cmodetrbl Exerciseguide

Enter systemshell and cd to /mroot/etc/mlog

Type “tail –f secd.log”

What do you see?

9. Given the results of the previous tests, what could the issue be here?

10. From ngsh(custershell) run:

cluster1::> vserver show -vserver studentX -fields rootvolume

vserver rootvolume

-------- -------------

studentX studentX_root

The value highlighted in bold is the root volume of the vserver you are acessing

cluster1::>vserver cifs share show -vserver studentX -share-name studentX

Vserver: studentX

Share: studentX

CIFS Server NetBIOS Name: STUDENTX

Path: /studentX_cifs

Share Properties: oplocks

browsable

changenotify

Symlink Properties: -

File Mode Creation Mask: -

Directory Mode Creation Mask: -

Share Comment: -

Share ACL: Everyone / Full Control

File Attribute Cache Lifetime: -

cluster1::*> vserver cifs share show -vserver studentX -share-name studentX_child

Page 45: Int Eng Ilt Cmodetrbl Exerciseguide

Vserver: studentX

Share: studentX_child

CIFS Server NetBIOS Name: STUDENTX

Path: /studentX_cifs_child

Share Properties: oplocks

browsable

changenotify

Symlink Properties: -

File Mode Creation Mask: -

Directory Mode Creation Mask: -

Share Comment: -

Share ACL: Everyone / Full Control

File Attribute Cache Lifetime: -

From the above commands obtain the name of the volumes being accessed via the shares

11. Now that you know the volumes you are trying to access use fsecurity show to view permissions on these.

cluster1::*> vol show -vserver studentX -volume studentX_cifs –instance

Find on which node the aggregate where studentX_cifs lives is hosted on

From node shell of that node run:

cluster1-01> fsecurity show /vol/studentX_cifs

What do you see?

cluster1::*> vol show -vserver studentX -volume studentX_cifs_child –instance

Find on which node the aggregate where studentX_cifs_child lives is hosted on

From node shell of that node run:

cluster1-01> fsecurity show /vol/studentX_cifs_child

What do you see?

Find on which node the aggregate where studentX_root lives is

Page 46: Int Eng Ilt Cmodetrbl Exerciseguide

hosted on

From node shell of that node run

.

cluster1-01> fsecurity show /vol/studentX_root

What do you see?

12. From ngsh run::

cluster1::*> volume modify -vserver studentX -volume studentX_root -unix-permissions 755

Queued private job: 167

Are you able to access both the shares now?

13. From ngsh run::

cluster1::*> volume modify -vserver studentX -volume studentX_cifs -security-style ntfs

Queued private job: 168

Does this resolve the issue?

Page 47: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 3: Authorization issues Time Estimate: 20 minutes

Step Action

1. From a client go Start -> Run -> \\studentX\studentX

What do you see?

2. Try to view the permissions on the share. What do you see?

3. From the nodeshell of the node where the volume and its aggregate is hosted run:

cluster1-01> fsecurity show /vol/student1_cifs

[/vol/student1_cifs - Directory (inum 64)]

Security style: NTFS

Effective style: NTFS

DOS attributes: 0x0010 (----D---)

Unix security:

uid: 0

gid: 0

mode: 0777 (rwxrwxrwx)

NTFS security descriptor:

Owner: S-1-5-32-544

Group: S-1-5-32-544

DACL:

Allow - S-1-5-21-3281022357-2736815186-1577070138-500 - 0x001f01ff (Full Control)

4. From the above command, obtain the sid of the owner of the volume.

From ngsh run:

Page 48: Int Eng Ilt Cmodetrbl Exerciseguide

cluster1::*> diag secd authentication translate -node local -vserver studentX -sid S-1-5-32-544

What do you see?

5. How do you resolve this issue?

Page 49: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 4: Export Policies Time Estimate: 20 minutes

Step Action

1. Try to access \\studentX\studentX

What do you see?

2. What error do you see?

3. What does the event log show? What about the secd log? (Exercise 2 steps 3 and 8)

4. From nodeshell of the node that hosts the volume and its aggregate run:

“fsecurity show /vol/studentX_cifs”

Do the permissions show that access should be allowed?

5. From clustershell obtain the name of the export-policy associated with the volume as follows:

cluster1::> volume show -vserver studentx -volume student1_cifs -fields policy

Now view details of the export-policy obtained in the previous command

cluster1::> export-policy rule show -vserver studentX -policyname <policy name obtained from the above command>

cluster1::> export-policy rule show -vserver studentX -policyname <policy name obtained from the above command> -ruleindex <rule index applicable>

What do you see?

How do you fix the issue?

Page 50: Int Eng Ilt Cmodetrbl Exerciseguide
Page 51: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 6: SCALABLE SAN

Exercise 1: Enable SAN features and create a LUN and connect via ISCSI Time Estimate: 20 minutes

Step Action

1. Review your SAN configuration on the cluster.

- Licenses

- SAN protocol services

- Interfaces

2. Create a lun in your studentX_san volume.

3. Create an igroup and add the ISCSI IQN of your host to the group.

4. Configure the ISCSI initiator

5. Map the lun and access from lab host. Format the lun and write data to it.

6. From clustershell

cluster1::*> iscsi show

What do you see?

cluster1::*> debug seqid show

What do you see?

7. 1. Locate the UUIDs of your iSCSI LIFs ::> debug smdb table vifmgr_virtual_interface show -lif-name <iscsi_lif>

2. Display the statistics for these LIFs cluster1::statistics*> show -node cluster1-01 -object iscsi_lif -counter iscsi_read_ops -instance <UUID obtained from the above command

Page 52: Int Eng Ilt Cmodetrbl Exerciseguide

EXERCISE 2

TASK 1: TROUBLESHOOT QUORUM ISSUES

In this task, you experience quorum failure on a node of the cluster.

STEP ACTION

1. Team member 1 login to console of clusterY-01 as admin

From here on this will be referred to as Node1

2. Team member 2 login to console of clusterY-02 as admin

From here on this will be referred to as Node2

3. Team member 1 on the Node 1 console ngsh

::> set diag

4. Team member 2 on the Node 2 console ngsh

::> set diag

5. Team member 2 on the Node 2 ngsh , verify cluster status

::*> cluster show

6. Team member 2 on the Node 2 ngsh, view the current LIFs:

::*> net int show

7. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:

::*> cluster kernel-service show -instance

8. Team member 2 on the Node 2 ngsh, bring down the cluster network LIFs on the interface:

::*> net int modify -vserver clusterY-02 -lif clus1,clus2 -status-admin down

Page 53: Int Eng Ilt Cmodetrbl Exerciseguide

STEP ACTION

9. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:

::*> cluster kernel-service show -instance

10. Team member 1 on the Node 1 ngsh, view the current cluster kernel status:

::*> cluster kernel-service show -instance

11. On the Node 2 PuTTY interface, enable the cluster network LIFs on the interface:

::*> net int modify -vserver cluster1-02 -lif clus1,clus2 -status-admin up

12. Team member 2 on the Node 2 ngsh, view the current cluster kernel status:

::*> cluster kernel-service show -instance

What do you see?

13. Team member 1 on the Node 1 ngsh, view the current cluster kernel status:

::*> cluster kernel-service show -instance

What do you see?

14. cluster1::*> debug smdb table bcomd_info show

What do you see?

Page 54: Int Eng Ilt Cmodetrbl Exerciseguide

STEP ACTION

15. Team member 1on the Node 1 ngsh, view the current bcomd information:

cluster1::*> debug smdb table bcomd_info show

What do you see?

16. Team member 2 reboot Node2 to have it start participating in SAN quorum again:

::*> reboot –node clusterY-02

17. Team member 2 console log in on Node2 as admin

18. Team member 2 on Node2, verify cluster health:

::> cluster show

19. Team member 2 on Node2

::> set diag

20. Verify the cluster kernel to verify both nodes have a status of in quorum (INQ):

::*> cluster kernel-service show –instance

::*>debug smdb table bcomd_info show

Page 55: Int Eng Ilt Cmodetrbl Exerciseguide

TASK 2: TROUBLESHOOT LOGICAL INTERFACE ISSUES

In this task, you bring down the LIFs that are associated with a LUN.

STEP ACTION

1. Console login as admin on clusterY-0X, view the current LIFs:

::*> net int show

2. On your own, disable LIFs that are associated with studentX_iscsi and determine how this action impacts connectivity to your LUN on the Windows host.

END OF EXERCISE

Exercise 3: Diag level SAN debugging Time Estimate: 25 minutes

Step Action

1. What are two ways we can see where the nvfail option is set on a volume?

2. How would we clear an nvfail state if we saw it?

3. How would we show virtual disk object information for a lun?

4. How do you manually dump a rastrace?

Page 56: Int Eng Ilt Cmodetrbl Exerciseguide

MODULE 7: SNAPMIRROR

Exercise 1: Setting up Intercluster SnapMirror Time Estimate: 20 minutes

Step Action

1. From clustershell of cluster1 run:

cluster1::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited

Error: command failed: Volume "cluster2://student3/student3_dest" not found.

(Failed to contact peer cluster with address 192.168.81.193. No

intercluster LIFs are configured on this node.)

2. From clustershell of cluster1 run:

::>set diag

cluster1::*> cluster peer address stable show

What do you see?

cluster1::*> net ::>int show -role intercluster

What do you see?

cluster1::*> cluster peer show -instance

What do you see?

cluster1::*> cluster peer show health –instance

What do you see? .

3. Run the following command:

Page 57: Int Eng Ilt Cmodetrbl Exerciseguide

::*> cluster peer ping -type data

What do you see?

4. Run the following command:

::*> cluster peer ping -type icmp

What do you see now? What addresses, if any, seem to be having issues?

5. Run the following command:

::> job history show -event-type failed

What jobs are failing?

To examine why they are failing:

cluster1::*> event log show -node cluster1-01 -messagename cpeer*

Why are the jobs failing?

6. Try to modify the cluster peer. What happens?

cluster1::*> cluster peer modify -cluster cluster2 -peer-addrs 192.168.81.193,192.168.81.194 -timeout 60

7. How did you resolve the issue?

Page 58: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 2: Intercluster DP mirrors Time Estimate: 30 minutes

Step Action

1. From clustershell of cluster1 run: cluster1::*> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited

What error do you see?What might he be doing wrong?

2. From clustershell of cluster2 run:

cluster2::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited

What do you see?Why?

3. After correcting the issue, run the following command in clustershell of cluster2:

cluster2::> snapmirror create -source-path cluster1://student1/student1_snapmirror -destination-path cluster2://student3/student3_dest -type DP -tries 8 -throttle unlimited

Does the command complete?

How do you verify the snapmirror exists?

Page 59: Int Eng Ilt Cmodetrbl Exerciseguide

::>snapmirror show

What do you see? Is the snapmirror functioning?

How do you get the mirror working if it’s not?

.

4. After the snapmirror is confirmed as functional, check to see how long it has been since the last update (snapmirror lag).

Page 60: Int Eng Ilt Cmodetrbl Exerciseguide

Exercise 3: LS Mirrors Time Estimate: 20 minutes

Step Action

1. Create two LS mirrors that point to your studentX_snapmirror volume.

clusterY::*> volume create -vserver studentX -volume studentX_LS_snapmirror -aggregate studentX -size 100MB -state online -type DP

[Job 265] Job succeeded: Successful

clusterY::*> volume create -vserver studentX -volume studentX_LS_snapmirror2 -aggregate studentX -size 100MB -state online -type DP

[Job 266] Job succeeded: Successful

clusterY::*> snapmirror create -source-path clusterY://studentX/studentX_snapmir ror -destination-path clusterY://studentX/studentX_LS_snapmirror2 -type LS

[Job 273] Job is queued: snapmirror create the relationship with destination clu [Job 273] Job succeeded: SnapMirror: done

clusterY::*> snapmirror create -source-path clusterY://studentX/studentX_snapmir ror -destination-path clusterY://studentX/studentX_LS_snapmirror -type LS

[Job 275] Job is queued: snapmirror create the relationship with destination clu [Job 275] Job succeeded: SnapMirror: done

What steps did you have to consider? Check the MSIDs and DSIDs for the source and destination volumes. What do you notice? clusterY::*> volume show -vserver studentX -fields msid,dsid

2. Attempt to initialize one of the mirrors using the “snapmirror initialize” command.

cluster1::*> snapmirror initialize -destination-path cluster1://student1/student1_LS_snapmirror

[Job 276] Job is queued: snapmirror initialize of destination cluster1://student1/student1_LS_snapmirror.

Page 61: Int Eng Ilt Cmodetrbl Exerciseguide

cluster1::*> snapmirror initialize -destination-path cluster1://student1/student1_LS_snapmirror2

[Job 277] Job is queued: snapmirror initialize of destination cluster1://student1/student1_LS_snapmirror2.

cluster1::*> job show

What happens? How would you view the status of the job? If it didn’t work, how would you fix it? Why didn’t it work?

cluster1::*> job history show -id 276

What do you see?

How do you fix it?

3. After initializing the LS mirrors, try to update the mirrors using “snapmirror update.”

clusterY::*> snapmirror update -destination-path clusterY://studentX/studentX_LS_snapmirror

[Job 279] Job is queued: snapmirror update of destination clusterY://studentX/studentX_LS_snapmirror.

clusterY::*> job show

What happens? How do you view the status of the job?

What is the issue?

4. Run the following command:

::> vol show -vserver studentX -fields junction-path

What do you see?

.

Mount the volume from the cluster shell.

Page 62: Int Eng Ilt Cmodetrbl Exerciseguide

::> vol nmount -vserver studentX -volume studentX_snapmirror –junction-path /student1_snapmirror

What do you see?

Run the following:

::> vol show -vserver studentX -fields junction-path

What do you see now?

Then remount the volume to a new junction path “studentX_snapmirror.”

::> vol mount -vserver studentX -volume studentX_snapmirror -junction-path /studentX_snapmirror

Now what do you see?

Unmount the volume from the cluster shell. ::> vol unmount -vserver studentX -volume studentX_snapmirror Run the following: ::> vol show -vserver studentX -fields junction-path

What do you see now?

Then remount the volume to a new junction path “studentX_snapmirror.” ::> vol mount -vserver studentX -volume studentX_snapmirror - junction-path /studentX_snapmirror Now what do you see?

5. clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_snapmirror

clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_root

clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 000

clusterY::*> volume show -vserver studentX -fields unix-permissions

What do you see?

Page 63: Int Eng Ilt Cmodetrbl Exerciseguide

Mount the volume from your Linux host using –o nfsvers=3:

[root@nfshost DATAPROTECTION]# mount -o nfsvers=3 student1:/student1_snapmirror /cmode

[root@nfshost DATAPROTECTION]# cd /cmode

[root@nfshost cmode]# ls

[root@nfshost cmode]# cd

[root@nfshost ~]# ls -latr /cmode

Now execute:

[root@nfshost ~]# umount /cmode

From clustershell run:

clusterY::*> snapmirror update-ls-set -source-path clusterY://studentX/studentX_snapmirror

From Linux Host run:

[root@nfshost ~]# mount -o nfsvers=3 student1:/student1_snapmirror /cmode

[root@nfshost ~]# ls -latd /cmode

What do you see?

Modify the volume back to 777 on the cluster (using vol modify)

clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 777

Queued private job: 162

Check permissions on the unix host again.

[root@nfshost ~]# ls -latd /cmode

ls: /cmode: Permission denied

[root@nfshost ~]# cd /cmode

What do you see?

Are you able to cd into the mount now?

Update the LS mirror set.

clusterY::*> snapmirror update-ls-set -source-path

Page 64: Int Eng Ilt Cmodetrbl Exerciseguide

clusterY://studentX/studentX_snapmirror

What do you see in ls on the host? Why?

Modify the source volume to 000

clusterY::*> volume modify -vserver studentX -volume studentX_snapmirror -unix-permissions 000

Queued private job: 163

What do you see in ls on the host? Why?