cmode explained latest

7/17/2019 Cmode Explained Latest

http://slidepdf.com/reader/full/cmode-explained-latest 1/546

© 2011 NetApp. All rights reserved. 1



In 1992, NetApp introduced Data ONTAP and ushered in the network-attached storage industry. Since

then, NetApp has continued to add features and solutions to its product portfolio to meet the needs of its

customers. In 2004, NetApp acquired Spinnaker Networks® in order to fold its scalable Clustered file

system technology into Data ONTAP. That plan came to fruition in 2006 as NetApp released Data ONTAP

GX, the first Clustered product from NetApp. NetApp also continued to enhance and sell Data ONTAP 7G.

Having two products provided a way to meet the needs of the NetApp customers who were happy with the

classic Data ONTAP, while allowing customers with certain application requirements to use Data ONTAP

GX to achieve even higher levels of performance, and with the flexibility and transparency afforded by its

scale-out architecture.

Although the goal was always to merge the two products into one, the migration path for Data ONTAP 7G

. . .

goal for Data ONTAP 8.0 was to create one code line that allows Data ONTAP 7G customers to operate a

Data ONTAP 8.0 7-Mode system in the manner in which they’re accustomed, while also providing a first

step in the eventual move to a Clustered environment. Data ONTAP 8.0 Cluster-Mode allows Data ONTAP

GX customers to upgrade and continue to operate their Clusters as they’re accustomed.

.




The direct link to “What is a cluster” VOD is available at :

http://netappenged.vportal.net/?auid=1000




Vserver - A vserver is an object that provides network access through unique network addresses, that may

serve data out of a distinct namespace, and that is separately administerable from the rest of the cluster.

There are three types of vservers: cluster, admin, node.

Cluster Vserver - A cluster vserver is the standard data serving vserver in cluster-mode. It is the

successor to the vserver of GX. It has both data and (optional) admin LIFs, and also owns a namespace

. , , , .

can live on separate virtual networks from other vservers.

Admin Vserver - Previously called the "C-server", the admin vserver is a special vserver that does not

provide data access to clients or hosts. However, it has overall administrative access to all objects in the

cluster, including all objects owned by other vservers.

Node Vserver - A node vserver is restricted to operation in a single node of the cluster at any one time,and provides administrative and data access to 7-mode objects owned by that node. The objects owned by

a node vserver will failover to a partner node when takeover occurs. The node vserver is equivalent to the

pfiler, also known as vfiler0 on a particular node. In 7G systems, it is commonly called the "filer".




This example shows many of the key resources in a cluster. There are three types of virtual servers,

plus nodes, aggregates, volumes, and namespaces.




Notice the types of vservers. Each node in the Cluster automatically has a node vserver created to

represent it. The administration vserver is automatically created when the Cluster is created. The Cluster

vservers are created by the administrator to build global namespaces.




Physical things can be touched and seen, like nodes, disks, and ports on those nodes.

Logical things cannot be touched, but they do exist and take up space. Aggregates are logical groupings of

disks. Volumes, Snapshot copies, and mirrors are areas of storage carved out of aggregates. Clusters are

groupings of physical nodes. A virtual server is a virtual representation of a resource or group of resources.

og ca n er ace s an a ress a s assoc a e w a s ng e ne wor por .

A cluster, which is a physical entity, is made up of other physical and logical pieces. For example, a cluster

is made up of nodes, and each node is made up of a controller, disks, disk shelves, NVRAM, etc. On the

disks are RAID groups and aggregates. Also, each node has a certain number of physical network ports,

each with its own MAC address.




Please refer to your Exercise Guide for more instructions.




Cluster-Mode supports V-Series systems. As such, the setup will be a little different when using V-Series.




Each controller should have a console connection, which is needed to get to the firmware and to get to the boot menu

(for the setup, install, and init options, for example). A Remote LAN Module (RLM) connection, although not required,

is very helpful in the event that you cannot get to the UI or console. It allows for remote rebooting and forcing core

dumps, among other things.

Each node must have at least one connection (ideally, two connections) to the dedicated cluster network. Each node

should have at least one data connection, although these data connections are only necessary for client access.

Because the nodes will be clustered together, it’s possible to have a node that participates in the cluster with its

storage and other resources, but doesn’t actually field client requests. Typically, however, each node will have data

connections.

.

must be on a network that is distinct from the cluster network.




There is a large amount of cabling to be done with a Data ONTAP 8.0 cluster. Each node has NVRAM

interconnections to its HA partner, and each node has Fibre Channel connections to its disk shelves and to those of its

HA partner.

This is standard cabling, and is the same as Data ONTAP GX and 7-Mode.




For cabling the network connections, the follow things must be taken into account:

•Each node is connected to at least two distinct networks; one for management (UI) and data access (clients), and one

for intra-cluster communication. Ideally, there would be at least two cluster connections to each node in order to create

redundancy and improve cluster traffic flow.

•The cluster can be created without data network connections but not without a cluster network connection.

•Having more than one data network connection to each node creates redundancy and improves client traffic flow.




To copy flash0a to flash0b, run flash flash0a flash0b. To “flash” (put) a new image onto the primary flash, you

must first configure the management interface. The -auto option of ifconfig can be used if the management

network has a DHCP/BOOTP server. If it doesn’t, you’ll need to run ifconfig <interface> -addr=<ip>

-mask=<netmask> -gw=<gateway>. After the network is configured, make sure you can ping the IP address of the

TFTP server that contains the new flash image. To then flash the new image, run flash

tftp://<tftp_server>/<path_to_image> flash0a.

The environment variables for Cluster-Mode can be set as follows:

•set-defaults

•setenv ONTAP_NG true

•setenv bootarg.init.usebootp false

•setenv bootarg.init.boot_clustered true




ONTAP 8.0 uses an environment variable to determine which mode of operation to boot with. For Cluster-Mode the

correct setting is:

LOADER> setenv bootarg.init.boot_clustered true

If the environment variable is unset, the controller will boot up in 7-Mode




The time it takes to initialize the disks is based on the size of one of the disks, not on the sum capacity of the disks,

because all disks are initialized in parallel with each other. Once the disks are initialized, the node’s first aggregate and

its vol0 volume will be automatically created.




After the reboot, if the node stops at the firmware prompt by itself (which will happen if the firmware environmentvariable AUTOBOOT is set to false), type boot_primary to allow it to continue to the boot menu. If AUTOBOOT is set

to true, the node will go straight to the boot menu.

When using TFTP, beware of older TFTP servers that have limited capabilities and may cause installation failures.




The setup option on the boot menu configures the local information about this node, such as the host name,

management IP address, netmask, default gateway, DNS domain and servers, and so on.




Autoconfig is still somewhat inflexible because: it doesn’t allow you to choose the host names of the nodes, only two

cluster ports can be configured, the cluster ports are fixed (always the same), and the cluster IPs are out of sequence.

As such, NetApp recommends that cluster joins be done manually.




The first node in the cluster will perform the "cluster create" operation. All other nodes will perform a "cluster join"

operation. Creating the cluster also defines the cluster-management LIF. The cluster-management LIF is an

administrative interface used for UI access and general administration of the cluster. This interface can failover to data-

role ports across all the nodes in the cluster, using pre-defined failover rules (clusterwide).

The cluster network is an isolated, non-routed subnet or VLAN, separate from the data or management networks, so

using non-routable IP address ranges is common and recommended.

Using 9000 MTU on the cluster network is highly recommended, for performance and reliability reasons. The clusterswitch or VLAN should be modified to accept 9000 byte payload frames prior to attempting the cluster join/create.




After a cluster has been created with one node, the administrator must invoke the cluster join command on each

node that is going to join the cluster. To join a cluster, you need to know a cluster IP address of one of the nodes in the

cluster, and you need some information that is specific to this joining node.

The cluster join operation ensures that the root aggregates are uniquely named. During this process, the first root

" "aggrega e w rema n name aggr an su sequen no e roo aggrega es w ave e os name appen e o e

aggregate name as "aggr0_node01". For consistency, the original aggregate should also be renamed to match the

naming convention, or renamed as per customer requirements.




When the storage controller(s) that were unjoined from the cluster are powered back, they will display information

about the cluster that it previously belonged to.




The base Cluster-Mode license is fixed and cannot be installed as a temporary/expiring license. The base license

determines the cluster serial number and is generated for a specific node count (as are the protocol licenses). The

base license can also be installed on top of an existing base license as additional node count are purchased. If a

customer purchases a 2-node upgrade for their current 2 node cluster, they will need a 4-node base licenses for the

given cluster serial number. The licenses are indexed on the NOW by the *cluster* serial number, not the node serial

number.

By default, there are no feature licenses installed on an 8.0 Cluster-Mode system as shipped from the factory. The

cluster create process installs the base license, and all additional purchased licenses can be found on the NOW site




The controllers will default to GMT timezone. Modify the date, time and timezone using the system date command.

While configuring the NTP is not a hard requirement for NFS only environments, it is for a cluster with the CIFS

protocol enabled and a good idea in most environments. If there are time servers available in the customer

environment, the cluster should be configured to sync up to them.

Time synchronization can take some time, depending on the skew between the node time and the reference clock

time.




Although the CLI and GUI interfaces are different, they both provide access to the same information, and both have the

ability to manage the same resources within the cluster. All commands are available in both interfaces. This will always

be the case because both interfaces are generated from the same source code that defines the command hierarchy.

The hierarchical command structure is made up of command directories and commands. A command directory may

contain commands and/or more command directories. Similar to a typical file system directory and file structure, the

- . , -

things fall somewhere within the storage command directory. Within that directory, there are directories for disk

commands and aggregate commands. The command directories provide the context that allows similar commands tobe used for different objects. For example, all objects/resources are created using a create command, and removed

using a delete command, but the commands are unique because of the context (command directory) in which they’re

used. So, storage aggregate create is different from network interface create.

There is a cluster lo in b wa of the cluster mana ement LIF. There is also a lo in ca abilit for each node b wa

of the node management LIF for each node.




The preferred way to manage the cluster is to log in to the clustershell by way of the cluster management LIF IPaddress, using ssh. If a node is experiencing difficulties and cannot communicate with the rest of the cluster, the node

management LIF of a node can be used. And if the node management LIF cannot be used, then the Remote LAN

Module (RLM) interface can be used.




This diagram shows the software stack making up Data ONTAP 8.0 Cluster-Mode. The most obvious difference

between this stack and the 7-Mode stack is the addition of a networking component called the N-blade, and more

logical interfaces (LIFs). Also, notice that Cluster-Mode does not yet support the SAN protocols (FC and iSCSI).

e - a e s e ne wor a e. rans a es e ween e pro oco s an an e p n pro oco

that the D-blade uses. SpinNP is the protocol used within a cluster to communicate between N-blades and D-blades.

In Cluster-Mode, the D-blade does not service NAS or SAN protocol requests.




Data ONTAP GX had one management virtual interface on each node. Cluster-Mode still has that concept, but it’s

called a “node management” LIF. Like the management interfaces of Data ONTAP GX, the node management LIFs do

not fail over to other nodes.

Cluster-Mode introduces a new management LIF, called the “cluster management” LIF, that has failover and migration

capabilities. The reason for this is so that regardless of the state of each individual node (rebooting after an upgrade,

, ,

current node location of that LIF is transparent.




The two “mgmt1” LIFs that are shown here are the node management LIFs, and are each associated with their

respective node virtual servers (vservers).

The one cluster management LIF, named “clusmgmt” in this example, is not associated with any one node vserver, but

rather is associated with the admin vserver, called “hydra,” which represents the entire physical cluster.




Nodeshell is accessible only via run -node from within the clustershell

Has visibility to only those objects that are attached to the given controller

Like Hardware, disks, aggregates, volumes, and things inside volumes like snapshots and qtrees.Both 7-Mode and

Cluster-Mode volumes on that controller are visible




In these examples, the hostname command was invoked from the UI of one node, but actually executed on the other

node. In the first example, the command was invoked from the clustershell. In the second example, the administrator

entered the nodeshell of the other node, and then ran the command interactively.




The FreeBSD shell is only to be used internally for ONTAP development, and in the field for emergency purposes

(e.g., system diagnostics by trained NetApp personnel). All system administration and maintenance commands must

be made available to customers via the cluster shell.




Access to the systemshell is not needed as much as it was in Data ONTAP GX because many of the utilities that only

ran in the BSD shell have now been incorporated into the clustershell.

But there are still some reasons why the systemshell may need to be accessed. No longer can you log in to a node or

the cluster as “root” and be placed directly to the systemshell. Access to the systemshell is limited to a user named

“diag,” and the systemshell can only be entered from within the clustershell.




The FreeBSD shell is accessible via the diag user account.

FreeBSD access will not have a default password and the diag account is disabled by default. The account can only be

enabled by the customer by explicitly setting a password from a privileged ONTAP account.

Diag passwords have two states:

blocked: which means there is no password and no one can log into diag

enabled: which means there is a password and one can log into diag

By default, diag is blocked. This default applies both on standalone nodes and clusters.




Element Manager is the web based user interface for administration of the cluster. All the operations which can be

done using the CLI, ZAPI etc can be done using this interface.

To use the Element Manager point to a web browser to the URL- http://<cluster_management_ip >/




SMF and RDB provide the basis for single system image administration of a cluster in the

M-host. SMF provides the basic command framework and the ability to route commands to

different nodes within the cluster. RDB provides the mechanism for maintaining cluster-wide

data.




The clustershell has features similar to the tcsh shell that is popular on UNIX® machines, such as the ability to pull

previous commands out of a command history buffer, then optionally edit those commands and reissue them. The

command editing is very similar to tcsh and Emacs editing, with key combinations like Ctrl-a and Ctrl-e to move the

cursor to the beginning and end of a command, respectively. The up and down arrows allow for cycling through the

command history.

Simple online help also is available. The question mark (?) can be used almost anywhere to get help within whatever

context you may find yourself. Also, the Tab key can be used in many of the same contexts to complete a command or

parameter in order to reduce the amount of typing you have to do.




The clustershell uses named parameters for every command.

Every command directory, command, and parameter can be abbreviated to the extent that it remains unambiguouswithin that context. For example, from the top level, the storage aggregate show command can be abbreviated

to be as short as sto a s. On the other hand, the network interface show command can be abbreviated as n

i s.

Commands can be run out of context. If we’re at the top level of the command hierarchy and type disk show, the

shell will run the storage disk show command, because it was able to resolve the disk command as being

unique within the whole command hierarchy. Likewise, if you simply type disk and hit ENTER, you’ll be put into the

storage disk command directory. This will work even if you’re in an unrelated command directory, say in the

network interface directory.

e c usters e supports quer es an -sty e patterns an w car s to ena e you to matc mu t p e va ues o

particular parameters. A simple example would be if you have a naming convention for volumes, such that every

volume owned by the Accounting department is named with a prefix of “acct_”, you could show only those volumesusing volume show –vserver * –volume acct_*. This will show you all volumes beginning with “acct_” and on

all vservers. If you want to further limit your query to volumes that have more than 500 GB of data, you could dosomething like: volume show –vserver * -volume acct_* -used >500gb.




These are the command directories and commands available at the top level of the command hierarchy.




This demonstrates how the question mark is used to show the available commands and command directories at any

level.




This demonstrates how the question mark is used to show the required and optional parameters. It can also be used to

show the valid keyword values that are allowed for parameters that accept keywords.

The Tab key can be used to show other directories, commands, and parameters that are available, and can complete

a command (or a portion of a command) for you.




This is the initial page that comes up when logging into the Element Manager. It’s a dashboard view of the

performance statistics of the entire cluster. The left pane of the page contains the command directories and

commands. When there is a “+” beside a word, it can be expanded to show more choices. Not until you click an object

at the lowest level will the main pane switch to show the desired details.




Notice the expansion of the STORAGE directory in the left pane.




This shows the further expansion of the aggregate directory within the STORAGE directory. The main pane continues

to show the Performance Dashboard.




After selecting “manage” on the left pane, all the aggregates are listed. Notice the double arrow to the left of each

aggregate. Clicking that will reveal a list of actions (commands) that can be performed on that aggregate.




This shows what you see when you click the arrow for an aggregate to reveal the storage aggregate commands. The

“modify” command for this particular aggregate is being selected.




The “modify” action for an aggregate brings up this page. You can change the state, the RAID type, the maximum

RAID size, or the high-availability policy. Also, from the “Aggregate” drop-down menu, you can select a different

aggregate to work on without going back to the previous list of all the aggregates.




This shows the set adv command (short for set -privilege advanced) in the clustershell. Notice the options

available for the storage directory before (using the admin privilege) and after (using the advanced privilege), where

firmware is available.

Note that the presence of an asterisk in the command prompt indicates that you are not currently using the admin

privilege.




This page, selected by clicking PREFERENCES on the left pane, is how you would change the privilege level from

within the GUI.

The privilege level is changed only for the user and interface in which this change is made, that is, if another admin

user is using the clustershell, that admin user’s privilege level is independent of the level in use here, even if both

interfaces are accessing the same node.




Here is an example of a FAS3040 or FAS3070 controller. Use this as a reference, but keep in mind that as new cards

are supported, some of this could change.




Here is an example of a FAS6070 or FAS6080 controller. Use this as a reference, but keep in mind that as new cards

are supported, some of this could change.




What are the characteristics of a cluster?

-- A collection of nodes consisting of one or more HA pairs

-- Each node connected to other nodes via redundant 10GbE cluster network

-- The cluster as a whole offering NetApp unified storage in a single namespace

-- Administered as a single unit, with delegation of virtual servers




This is the back of a typical disk shelf. Here, we’re highlighting the in and out ports of loop A (top) and loop B (bottom).




The following example shows what the storage show disk -port command output looks like for an SFOconfiguration that does not use redundant paths:

- -

Primary Port Secondary Port Type Shelf Bay

--------------- ---- --------------- ---- ------ ----- ---

node2a:0a.16 A - - FCAL 1 0






.

.

.

node2a:0b.21 B - - FCAL 1 5






Multipath HA Storage enhances data availability and performance for active/active system configurations. It is highly

recommended for customers who want to avoid unnecessary failovers resulting from storage-related faults. By

providing redundant paths, Multipath HA Storage avoids controller failover due to storage faults from shelf I/O

modules, cables, and disk HBA failures.

MultiPathing is supported on ESH2, ESH4 and AT-FCX disk shelves. If the shelf modules are not of these types

then upgrade them before proceeding. If there are no free HBAs on the node, then add additional HBAs.

Use following procedure to dual-path each loop. This can be done while the node is online.

Insert optical connectors into the out connection on both the A and B modules on the last shelf in the loop.

Determine if the node head is plugged into the A or B module of the first shelf.

Connect a cable from a different host adapter on the node to the opposite module on the last shelf. For

example, if the node is attached, via adapter 1, to the in port of module A on the first shelf, then it should be

attached, via adapter 2, to the out port of the module B on the last shelf and vice versa.

Repeat step 2 for all the loops on the node.

Repeat steps 2 & 3 for the other node in the SFO pair.

Use the storage disk show –port command to verify that all disks have two paths.




As a best practice cable shelf loops symmetrically for ease of administration - Use the same node FC port for owner

and partner to ease administration.

Consult the appropriate ISI (Installation and Setup Instructions) for graphical cabling instructions.




The types of traffic that flow over the InfiniBand links are:

•Failover: The directives related to performing storage failover (SFO) between the two nodes, regardless of whether

the failover is:

negotiated (planned and as a response to administrator request)

non-negotiated (unplanned in response to a dirty system shutdown or reboot)

•Disk firmware: Nodes in an HA pair coordinate the update of disk firmware. While one node is updating the firmware,

the other node must not do any I/O to that disk

•Heartbeats: Regular messages to demonstrate availability

•Version information: The two nodes in an HA pair must be kept at the same major/minor revision levels for all




Each node of an HA pair designates two disks in the first RAID group in the root aggregate as the mailbox disks. The

first mailbox disk is always the first data disk in RAID group RG0. The second mailbox disk is always the first parity

disk in RG0. The mroot disks are generally the mailbox disks.

Each disk, and hence each aggregate and volume built upon them, can be owned by exactly one of the two nodes in

e pa r a any g ven me. s orm o so ware owners p s ma e pers s en y wr ng e n orma on on o e

disk itself. The ability to write disk ownership information is protected by the use of persistent reservations. Persistent

reservations can be removed from disks by power-cycling the shelves, or by selecting Maintenance Mode while in

Boot Mode and issuing manual commands there. If the node that owns the disks is running in normal mode, it

reasserts its persistent reservations every 30 seconds. Changes in disk ownership are handled automatically by

normal SFO operations, although there are commands to manipulate them manually if necessary.

' ' , .

However, only the node marked as that disk's current owner is allowed to write to it.




Persistent reservations can be removed from disks by power-cycling the shelves, or by selecting Maintenance Mode

while in Boot Mode and issuing manual commands there. If the node that owns the disks is running in normal mode, it

reasserts its persistent reservations every 30 seconds.

A disk's data contents are not destroyed when it is marked as unowned, only its ownership information is erased.

nowne s s res ng on an - oop, w ere owne s s ex s , w ave owners p n orma on au oma ca y

applied to guarantee all disks on the same loop have the same owner.




To enable SFO within an HA pair, the nodes must have the Data ONTAP 7G “cf” license installed on them, and they

must both be rebooted after the license is installed. Only then can SFO be enabled on them.

Enabling SFO is done within pairs regardless of how many nodes are in the cluster. For SFO, the HA pairs must be of

the same model, for exam le, two FAS3050s, two FAS6070s, and so on. The cluster itself can contain a mixture of

models but each HA pair must be homogenous. The version of Data ONTAP must be the same on both nodes of the

HA pair, except for the short period of time during which the pair is being upgraded. During that time, one of the nodes

will be rebooted with a newer release than its partner, with the partner to follow shortly thereafter. The NVRAM cards

must be installed in the nodes, and two interconnect cables are needed to connect the NVRAM cards to each other.

Remember, this cluster is not simply the pairing of machines for failover; it’s the Data ONTAP cluster.




In SFO, interface failover is separate out from storage failover. So give back returns first aggregate which

has mroot volume of partner node and then rest of the aggregates one-by-one




Multiple controllers are connected together to provide a high-level of hardware redundancy and resilience againstsingle points of failure.

All controllers in an HA array can access the same shared storage backend.

,data loss in the event of failure.

In the future, HA array will likely expand to include more than two controllers.




If the node-local licenses are not installed on each node, enabling storage failover will result in an error. Verify and/or

install the appropriate node licenses, reboot each node.

Enable SFO on one node per HA-pair (reboot required later) for cluster > 2 nodes.




CFO used to stand for “cluster failover,” but the term “cluster” is no longer being used in relation to Data ONTAP 7G or

Data ONTAP 8.0 7-Mode.




This example shows a 2-node cluster, which is also an HA pair. Notice that SFO is enabled on both nodes.




When the aggregates of one node failover to the SFO partner node, the aggregate that contains the mroot of that node

goes too. Each node needs its mroot to boot, so when the rebooted node begins to boot, the first thing that happens is

that it signals the partner to do a sendhome of that one aggregate and then it waits for that to happen. If SFO is

working properly, sendhome will happen quickly, the node will have its mroot and be able to boot, and then when it

gets far enough in its boot process, the rest of the aggregates will be sent home (serially). If there are problems, you’ll

“ ” ’ . ,

are stuck in a transition state between the two nodes and may not be owned by either node. If this happens, contact

NetApp Technical Support.

The EMS log will show why the sendhome was vetoed.




Note: Changing epsilon can be run from any node in the cluster.

The steps to move epsilon are as follows:

1. Mark all nodes in the cluster as – e silon false.

2. Mark one node – epsilon true




In a 2-node cluster, the choices for RDB are:

Both sites required for online service (a 2/2 quorum)

Master/slave configuration, where one designated site is required for online operation (a (1+e)/2 quorum)

Without the epsilon node, only 1 out of 2 nodes are available, and the quorum requirement is the bare majority (1+e)/2.

That represents a single point of failure.

Both these options suffer from some combination of availability issues, potential for data loss, and lack of full

automation. The goal must be availability, complete data integrity, and no need for human intervention – just as for

clusters of all other sizes.

Every node acts as an RDB replication site, and nodes are always sold in SFO pairs. So, the 2-node configurationsare going to be quite common, and the technical issue represents a practical concern; if the wrong node crashes, all

the RDB-applications on the other will stay offline until it recovers.

The problem is to provide a highly-available version of the RDB data replication service for the 2-node cluster – staying

online when either one of the two nodes crash.




For clusters of only two nodes, the replicated database (RDB) units rely on the disks to help maintain quorum within

the cluster in the case of a node being rebooted or going down. This is enabled by configuring this 2-node HA

mechanism. Because of this reliance on the disks, SFO enablement and auto-giveback is also required by 2-node HA

and will be configured automatically when 2-node HA is enabled. For clusters larger than two nodes, quorum can be

maintained without using the disks. Do not enable 2-node HA for clusters that are larger than two nodes.




Note : 2-node HA mode should be disabled on an existing 2-node cluster prior to joining the third and subsequent

nodes




The HA policy determines the takeover and giveback behavior and is set to either CFO or SFO.

CFO HA Policy: CFO policy aggregates (or CFO aggregates for short) can contain 7-mode volumes. When these

aggregates are taken over they are available in partner mode. During giveback, all CFO aggregates are given back in

one step. This is same as what happens during takeover and giveback on 7g. CFO aggregates can also contain

cluster mode volumes but this is not recommended because such cluster mode volumes could experience longer

volumes. Cluster-mode volumes are supported in CFO aggregates because Tricky allowed data volumes in a root

aggregate.

SFO HA Policy: SFO policy aggregates (or SFO aggregates for short) can contain only cluster mode volumes. They

cannot contain 7-mode volumes. When these aggregates are taken over they are available in local mode. This is same

as what happens during takeover on GX. During giveback, the CFO aggregates are given back first, the partner bootsand then the SFO aggregates are given back one aggregate at a time. This SFO aggregate giveback behavior is same

as the non-root aggregate giveback behavior on GX.

The root aggregate has policy of CFO in cluster mode. In BR.0 cluster-mode, only the root can have CFO policy. All

other aggregates will have SFO policy.




Here we see that each of our nodes contains three aggregates.





39



Cluster-Mode volumes can be flexible volumes. The flexible volumes are functionally equivalent to flexible volumes in

7-Mode and Data ONTAP 7G. The difference is in how they’re used. Because of the flexibility inherent in Data ONTAP

clusters (specifically, the volume move capability), volumes are deployed as freely as UNIX® directories and

Windows® folders to separate logical groups of data. Volumes are created and deleted, mounted and unmounted, and

moved around as needed. To take advantage of this flexibility, cluster deployments typically use many more volumes

than traditional 7G deployments.

Volumes can be moved around, copied, mirrored, and backed up.




This example shows some volumes. The name for the vserver root volume was chosen by the administrator to indicate

clearly that the volume is a root volume.

You can see that the Type values are all “RW,” which shows that these are read/write volumes, as opposed to load-

sharing (LS) mirrors or data protection (DP) mirrors. We’ll learn more about mirrors later.

Also, the difference between the Size and Available values is the amount of the volume that is used, but also reflects

some administrative space used by the WAFL® (Write Anywhere File Layout) file system, as well as Snapshot reserve

space.




For example, an explicit NFS licence is required (was not previously with GX). Mirroring requires a

new license.


44



ONTAP 8.0 Cluster-Mode supports a limited subset of the 7-Mode qtree functionality. In cluster-mode, they are

basically quota containers, not as a storage unit of management.

Qtrees can be created within flexvols and can be configured with a security style and default or specific tree quotas.

ser quo as are no suppor e n e . . re ease, an ac up unc ona y rema ns arge e a e vo ume eve .




Cluster virtual servers are integral part of the cluster architecture and the means for achieving secure multi-tenancy

and delegated administration. They serve data out of its namespace, have its own network identities and

administrative domains.

A cluster virtual server (vserver) ties together volumes, logical interfaces, and other things for a namespace. No

volumes can be created until there is a cluster vserver with which to associate them.




Think of the cluster as a bunch of hardware (nodes, disk shelves, and so on). A vserver is a logical piece of that

cluster, but it is not a subset or partitioning of the nodes. It’s more flexible and dynamic than that. Every vserver can

use all the hardware in the cluster, and all at the same time.

Here is a simple example: A storage provider has one cluster, and two customers, ABC Company and XYZ Company.

vserver can e crea e or eac company. e a r u es a are re a e o spec c vservers vo umes, s,

mirrors, and so on) can be managed separately, while the same hardware resources can be used for both. One

company can have its own NFS server, while the other can have its own NFS and CIFS servers, for example.




There is a one-to-many relationship between a vserver and its volumes. The same is true for a vserver and its data

LIFs. Cluster vservers can have many volumes and many data LIFs, but those volumes and LIFs are associated only

with this one cluster vserver.




Please note that this slide is a representation of logical concepts and is not meant to show any physical relationships.

For example, all of the objects shown as part of a vserver are not necessarily on the same physical node of the cluster.

In fact, that would be very unlikely.

This slide shows four distinct vservers and names aces . Althou h the hardware is not shown, these four vservers

could be living within a single cluster. These are not actually separate entities of the vservers, but are shown merely to

indicate that each vserver has a namespace. The volumes, however, are separate entities. Each volume is associated

with exactly one vserver. Each vserver has one root volume, and some have additional volumes. Although a vserver

may only have one volume (its root volume), in real life it is more likely that a vserver would be made up of a number

of volumes, possibly thousands. Typically, a new volume is created for every distinct area of storage. For example,

every department and/or employee may have its own volume in a vserver.




A namespace is simply a file system. It is the external (client-facing) representation of a vserver. It is made up of

volumes that are joined together through junctions. Each vserver has exactly one namespace, and the volumes in one

vserver cannot be seen by clients that are accessing the namespace of another vserver. Namespace provides the

logical arrangement of the NAS data available in the Vserver.




These nine volumes are mounted together via junctions. All volumes must have a junction path (mount point) to be

accessible within the vserver’s namespace.

Volume R is the root volume of a vserver. Volumes A, B, C, and F are mounted to R through junctions. Volumes D and

E are mounted to C throu h unctions. Likewise, volumes G and H are mounted to F.

Every vserver has its own root volume, and all non-root volumes are created within a vserver. All non-root volumes are

mounted into the namespace, relative to the vserver root.




This is a detailed volume show command. Typing this will show a summary view of all volumes. If you do a show of

a specific virtual server and volume, you’ll see the instance (detailed) view of the volume rather than the summary list

of volumes.




Junctions are conceptually similar to UNIX mountpoints. In UNIX, a hard disk can be carved up into partitions and then

those partitions can be mounted at various places relative to the root of the local file system, including in a hierarchical

manner. Likewise, the flexible volumes in a Data ONTAP cluster can be mounted at junction points within other

volumes, forming a single namespace that is actually distributed throughout the cluster. Although junctions appear as

directories, they have the basic functionality of symbolic links.

A volume is not visible in its vserver’s namespace until it is mounted within the namespace.




Typically, when volumes are created by way of the volume create command, a junction path is specified at that

time. That is optional; a volume can be created and not mounted into the namespace. When it’s time to put that volumeinto use, the volume mount command is the way to assign the junction path to the volume. The volume also can be

unmounted, which takes it out of the namespace. As such, it is not accessible by NFS or CIFS clients, but it is still

online, and can be mirrored, backed up, moved and so on. It then can be mounted again to the same or different place

in the namespace and in relation to other volumes (for example, it can be unmounted from one parent volume and

mounted to another parent volume).




This is a representation of the volume hierarchy of a namespace. These five volumes are connected by way of

junctions, with the root volume of the namespace at the “top” of the hierarchy. From an NFS or CIFS client, this

namespace will look like a single file system.




It’s very important to know the differences between what the volume hierarchy looks like to the administrator

(internally) as compared to what the namespace looks like from an NFS or CIFS client (externally).

The name of the root volume of a vserver (and hence, the root of this namespace) can be chosen by the administrator,

but the unction ath of the root volume is alwa s /. Notice that the unction ath for the mount oint of a volume is not

tied to the name of the volume. In this example, we’ve prefixed the name of the volume smith_mp3 to associate it with

volume smith, but that’s just a convention to make the relationship between the smith volume and its mp3 volume

more obvious to the cluster administrator.




Here again is the representation of the volumes of this namespace. The volume names are shown inside the circles

and the junction paths are listed outside of them. Notice that there is no volume called “user.” The “user” entity is

simply a directory within the root volume, and the junction for the smith volume is located in that directory. The acct

volume is mounted directly at the /acct junction path in the root volume.




Kernel modules are loaded into the FreeBSD kernel. This gives them special privileges that are not available to user

space processes. There are great advantages to being in the kernel; there are downsides too. For one, it’s more

difficult to write kernel code, and the penalty for a coding error is great. User space processes can be swapped out by

the operating system, but on the plus side, user space processes can fail without taking the whole system down, and

can be easily restarted on the fly.




This diagram shows the software stack making up Data ONTAP 8.0 Cluster-Mode. The most obvious difference

between this stack and the 7-Mode stack is the addition of a networking component called the N-blade, and more

logical interfaces (LIFs). Also, notice that Cluster-Mode does not yet support the SAN protocols (FC and iSCSI).

e - a e s e ne wor a e. rans a es e ween e pro oco s an an e p n pro oco

that the D-blade uses. SpinNP is the protocol used within a cluster to communicate between N-blades and D-blades.

In Cluster-Mode, the D-blade does not service NAS or SAN protocol requests.




All nodes in a cluster have these kernel modules:

•common_kmod.ko: The Common module is the first kernel module to load. It contains common services, which are

shared by modules that load after it.

•nvram5.ko: A low-level hardware driver for NVRAM5.

•nvram_mgr.ko: This segments NVRAM for various users, provides some common access functions that NVRAM5

doesn't provide, and provides centralized power management.

•nvr.ko: A character device driver for interfacing with individual regions of NVRAM (for example, /var).

•maytag.ko: This is the D-blade. It is a stripped-down and modified Data ONTAP 7G, which includes the WAFL® file

system, RAID, and storage components, and the SpinNP translation layers.

•nbladekmod.ko: The N-blade contains the network stack, protocols, and SpinNP translation layers.

•spinvfs.ko: SpinVFS enables the user space components to access volumes in the cluster.




•N-blade (network, protocols)

•CSM (and SpinNP)

•D-blade (WAFL, NVRAM, RAID, storage)

•Management (sometimes called M-host)




The term “blade” refers to separate software state machines, accessed only by well-defined application program

interfaces, or APIs. Every node contains an N-blade, a D-blade, and Management. Any N-blade in the cluster can talk

to any D-blade in the cluster. Each node has an N-blade, a D-blade, and Management.

The N-blade translates client requests into Spin Network Protocol (SpinNP) requests (and vice versa). The D-blade,

w c con a ns e r e nyw ere e ayou e sys em, an es p n reques s. s e p n

layer between the N-blade and D-blade.

The members of each RDB unit, on every node in the cluster, are in constant communication with each other to remain

in sync. The RDB communication is like the heartbeat of each node. If the heartbeat cannot be detected by the other

members of the unit, the unit will correct itself in a manner to be discussed later. The three RDB units on each node

are: VLDB, VifMgr, and Management. There will be more information about these RDB units later.




This graphic is very simplistic, but each node contains the following: N-blade, CSM, D-blade, M-host, RDB units (3),

and the node’s vol0 volume.




An NFS or CIFS client sends a write request to a data logical interface, or LIF. The N-blade that is currently associated

with that LIF translates the NFS/CIFS request to a SpinNP request. The SpinNP request goes through CSM to the

local D-blade. The D-blade sends the data to nonvolatile RAM (NVRAM) and to the disks. The response works its way

back to the client.




This path is mostly the same as the local write request, except that when the SpinNP request goes through CSM, it

goes to a remote D-blade elsewhere in the cluster, and vice versa.




The N-blade architecture comprises a variety of functional areas, interfaces and components. The N - blade itself

resides as a loadable module within the FreeBSD kernel. It relies heavily on services provided by SK (within the D-

blade).

The N-blade supports a variety of Protocols. Interaction with these protocols is mediated by the PCP (protocol and

.

the network protocol stack/device drivers.




•Transports requests from any N-blade to any D-blade and vice versa (even on the same node)

•The protocol is called SpinNP (Spinnaker network protocol) and is the language that the N-blade speaks to the D-

blade

•Uses UDP/IP




SpinNP is the protocol family used within a cluster or between clusters to carry high frequency/high bandwidthmessages between blades or between an m-host and a blade.




Cluster Session Manager (CSM) is the communication layer that manages connections using the SpinNP protocol

between two blades. The blades can be either both local or one local and one remote. Clients of CSM use it because it

provides for blade to blade communication without the client's knowledge of where the remote blade is located.




Basically a wrapper around Data ONTAP 7G that translates SpinNP for WAFL. The Spinnaker D-blade (SpinFS file

system, storage pools, VFS, Fibre Channel Driver, N+1 storage failover) was replaced by Data ONTAP (encapsulated

into a FreeBSD kernel module)

•Certain parts of the “old” Data ONTAP aren’t used (UI, network, protocols)

“• spea s p n on e ron en

•The current D-blade is mostly made up of WAFL




D-blade is the disk facing software kernel module and is derived from ONTAP. It contains WAFL, RAID, Storage.

SpinHI is part of the D-blade, and sits directly above WAFL. It processes all incoming SpinNP fileop messages. Most

of these are translated into WAFL messages




•Also known as “Management”

•Based on code called “Simple Management Framework” (SMF)

•Cluster, nodes, and virtual servers can be managed by any node in the cluster




The M-Host is a User space environment on a node along with the entire collection of software services :

Command shells and API servers.

Service processes for upcalls from the kernel.

User space implementation of network services, such as DNS, and file access services such as HTTP and FTP.

n er y ng c uster serv ces suc as , c uster mem ers p serv ces, quorum.

Logging services such as EMS.

Environmental monitors.

Higher level cluster services, such as VLDB, job manager, and LIF manager.

Processes that interact with external servers, such as Kerberos, LDAP.

Processes that perform operational functions such as NDMP control, and auditing.

erv ces a opera e on a a, suc as n -v rus an n ex ng.




SMF currently supports two types of persistent data storage via table level attributes: persistent and replicated. The

replicated tables are identical copies of the same set of tables stored on every node in the cluster. Persistent tables are

node specific and stored locally on each node in the cluster.

Collo uiall these table attributes are referred to as RDB re licated and CDB ersistent .




The volume location database (VLDB) is a replicated database

Stores tables used by N-blades and the system management processes to find the D-blade to which to send

requests for a particular volume.

Note that all of these mappings are composed and cached in the N-blade’s memory, so that the results of

All lookups are typically available after a single hash table lookup




SecD Relies on External Servers

- DNS Servers

Used for name to IP address looku s o

Used by Server Discovery to get the Domain Controllers for a Windows domain. Microsoft stores this info in DNS.

-Windows Domain Controllers

Used by SecD to create a CIFS server's Windows machine account

Used to perform CIFS authenticationRetrieved from DNS using the CIFS domain name included in the 'cifs create' command

Preferred DCs can be s ecified in the confi uration

- NIS Servers If configured, can be used to obtain credentials for UNIX users

NIS must be included in the vserver's ns-switch option - LDAP Servers

If configured, can be used to obtain credentials for UNIX users, and as a source for UNIX accounts during name

mapping

LDAP must be included in the vserver's ns-switch and/or nm-switch options.

In some cases these servers can be automatically detected, and in others the servers must be defined in the

con iguration.




Manages all cluster-mode network connections, data, cluster, and mgmt networks.

Uses RDB to store network configuration information

User RDB to know when to migrate a LIF to another node.




The vol0 volume of a node is analogous to the root volume of a Data ONTAP® 7G system. It contains the data needed

for the node to function.

The vol0 volume does not contain any user data, nor is it part of the namespace of a vserver. It lives (permanently) on

the initial aggregate that is created when each node is initialized.

The vol0 volume is not protected by mirrors or tape backups, but that’s OK. Although it is a very important volume (a

node cannot boot without its vol0 volume), the data contained on vol0 is (largely) re-creatable. If it were lost, the log

files would indeed be gone. But because the RDB data is replicated on every node in the cluster, that data can be

automatically re-created onto this node.




Each vserver has one namespace and, therefore, one root volume. This is separate from the vol0 volume of each

node.




The RDB units do not contain user data, but rather they contain data that helps manage the cluster. These databases

are replicated, that is, each node has its own “copy” of the database, and that database is always in sync with the

databases on the other nodes in the cluster. RDB database reads are performed locally on each node, but an RDB

write is performed to one “master” RDB database, and then those changes are replicated to the other databases

throughout the cluster. When reads are done of an RDB database, they can be fulfilled locally, without the need to

send any requests over the cluster networks.

The RDB is transactional in that it guarantees that when something is being written to a database, either it all gets

written successfully or it all gets rolled back. No partial/inconsistent database writes are committed.

There are three RDB units (VLDB, Management, VifMgr) in every cluster, which means that there are three RDB unit

databases on every node in the cluster.




Replicated Database

Currently three RDB units: VLDB, VifMgr, Management

Maintains the data that manages the cluster

Each unit has its own replication unit

Unit is made up of one master (read/write) and other secondaries (read-only)

One node contains the master of an RDB app, others contain the secondaries

Writes go to the master, then get propagated to others in the unit (via the cluster network)

Enables the consistency of the units through voting and quorum

The user space processes for each RDB unit vote to determine which node (process) will be the master

Each unit has a master, which could be a different node for each unit

The master can change as quorum is lost and regained

An RDB unit is considered to be healthy only when it is “in quorum” (i.e., a master is able to be elected)

A simple majority of online nodes are required to have a quorum

One node is designated as “epsilon” (can break a tie) for all RDB units

A RDB replication ring stays “online” as long as a bare majority of the application instances are healthy and in

communication (a quorum). When an instance is online (part of the quorum), it enjoys full read/write capability on up-

- . , - - -

local replica. The individual applications all require online RDB state to provide full service.




Each RDB unit has it own ring. If n is the number of nodes in the cluster, then each unit/ring is made up of n databases

and n processes. At any given time, one of those databases is designated as the master and the others are designated

as secondary databases. Each RDB unit’s ring is independent of the other RDB units. If nodeX has the master

database for the VLDB unit, nodeY may have the master for the VifMgr unit and nodeZ may have the master for the

Management unit.

The master of a given unit can change. For example, when the node that is the master for the Management unit gets

rebooted, a new Management master needs to be elected by the remaining members of the Management unit. It’s

important to note that a secondary can become a master and vice versa. There isn’t anything special about the

database itself, but rather the role of the process that manages it (master versus secondary).

,

of immediately replicating the changes to the secondary databases on the other nodes. If a change cannot be

replicated to a certain secondary, then the entire change is rolled back everywhere. This is what we mean by no partial

writes. Either all databases of an RDB unit get the change, or none get the change.




Quorum requirements are based on a straight majority calculation. To promote easier quorum formation given an

even number of replication sites, one of the sites is assigned an extra partial weight (epsilon). So, for a cluster of 2n

sites, quorum can be formed by the n-site partition that includes the epsilon site.




Let’s define some RDB terminology. A master can be elected only when there is a quorum of members available (and

healthy) for a particular RDB unit. Each member votes for the node that it thinks should be the master for this RDB

unit. One node in the cluster has a special tie-breaking ability called “epsilon.” Unlike the master, which may be

different for each RDB unit, epsilon is a single node that applies to all RDB units.

uorum means a a s mp e ma or y o no es are ea y enoug o e ec a mas er or e un . e eps on power s

only used in the case of a voting tie. If a simple majority does not exist, the epsilon node (process) chooses the master

for a given RDB unit.

A unit goes out of quorum when cluster communication is interrupted, for example, due to a reboot, or perhaps a

cluster network hiccup that lasts for a few seconds. It comes back into quorum automatically when the cluster

communication is restored.




In normal operation, cluster-wide quorum is required to elect the master.

For quorum, a simple majority of connected, healthy, active nodes is required; or

For N = 2n or 2n+1, Quorum >= (n+1) required

Alternativel an artificial ma orit : half the nodes includin confi uration e silon,

For N = 2n w/epsilon, Quorum >= (n+e)

For N = 2n+1 w/epsilon, Quorum >= (n+1)




A master can be elected only when there is a majority of local RDB units connected (and healthy) for a particular RDB

unit. A master is elected when each local unit agrees on the first reachable healthy node in the RDB site list. A

“healthy” node would be one that is connected, able to communicate with the other nodes, has CPU cycles, and has

reasonable I/O.

e mas er o a g ven un can c ange. or examp e, w en e no e a s e mas er or e anagemen un ge s

rebooted, a new Management master needs to be elected by the remaining members of the Management unit.

A local unit goes out of quorum when cluster communication is interrupted for a few seconds, for example, due to a

reboot, or perhaps a cluster network hiccup that lasts for a few seconds. It comes back in quorum automatically as the

RDB units are always working to monitor and maintain a good state. When a local unit goes out of quorum and then

comes back into quorum, the RDB unit is re-synchronized. It’s important to note that the VLDB process on a node

, .

When a unit goes out of quorum, reads from that unit can be done, but writes to that unit cannot. That restriction is

enforced so that no changes to that unit happen during the time that a master is not agreed upon. Besides the VLDB

example above, if the VifMgr goes out of quorum, access to LIFs is not affected, but no LIF failover can occur.




Marking a node as ineligible (by way of the cluster modify command) means that it no longer affects RDB quorum or

voting. If the epsilon node is marked as ineligible, epsilon will be automatically given to another node.




The cluster ring show command is available only at the advanced privilege level or higher.

The “DB E och” values of the members of a iven RDB unit should be the same. For exam le as shown the DB . , ,

epoch for the mgmt unit is “8,” and it’s “8” on both node5 and node8. But that is different than the DB epoch for the vldb

unit, which is “6.” This is fine. The DB epoch needs to be consistent across nodes for an individual unit. Not all units

have to have the same DB epoch.




Whenever RDB ring forms a new quorum and elects the RDB master, the master starts a new epoch.

Combination of epoch number and transaction number <epoch,tnum> is used to construct RDB versioning.

The transaction number is incremented with each RW transaction.

All RDB copies that have the same <epoch,tnum> combination contain exactly the same information.




When a majority of the instances in the RDB ring are available, they elect one of these instances the master, with the

others becoming secondary's. The RDB master is responsible for controlling updates to the data within the replication

ring

When one of the nodes wishes to make an update, it must first obtain a write transaction from the master. Under this

transaction, the node is free to make whatever changes it wants; however, none of these changes are seen externally

. ,

in the ring.




If a quorum’s worth of nodes is updated, the changes are made permanent; if not, the changes are rolled back.




One node in the cluster has a special voting weight called epsilon. Unlike the masters of each RDB unit, which may be

different for each unit, the epsilon node is the same for all RDB units. This epsilon vote is only used in the case of an

even partitioning of a cluster, where, for example, four nodes of an eight-node cluster cannot talk to the other four

nodes. This is very rare, but should it happen, a simple majority would not exist and the epsilon node would sway the

vote for the masters of the RDB units.




From Ron Kownacki, author of the RDB:

“Basically, quorum majority doesn't work well when down to two nodes and there's a failure, so RDB is essentially

locking the fact that quorum is no longer being used, and enabling a single replica to be artificially writable during that

outage.

“The reason we require a quorum (a majority) is so that all committed data is durable - if you successfully write to a

majority, you know that any future majority will contain at least one instance that has seen the change, so the update is

durable. If we didn't always require a majority, we could silently lose committed data. So in two nodes, the node with

epsilon is a majority, and the other is a minority - so you would only have one directional failover (need the majority).

So epsilon gives you a way to get majorities where you normally wouldn't have them, but it only gives unidirectional

failover because it's static.

“ n wo-no e g ava a y mo e , we ry o ge rec ona a over. o o s, we remove e con gura on

epsilon, and make both nodes equal - and form majorities artificially in the failover cases. So quorum is 2/2 (no epsilon

involved), but if there's a failover, you artificially designate the survivor as the majority (and lock that fact). However,

that means you can't failover the other way until both nodes are available, they sync up, and drop the lock - otherwise

you would be discarding data.”




This diagram shows that each node contains the following: N-blade, CSM, D-blade, M-host, RDB units (3), and vol0.




The bundled cluster and management switch infrastructure consists of:

Cluster: Cisco NX5010/NX5020 (20/40 port, 10GbE)

Management: Cisco 2960 (24 port 10/100)

Switch Cabling:

Cable ISL ports (8x Copper ISL ports)

NX5010: Ports 13-20

NX5020: Ports 33-40

Cable mgmt switch ISL and customer uplink

Cable NX50x0 to mgmt switch

Cable controller cluster ports

cluster port1 -> sw A, cluster port2 -> sw B

Cable Management Ports

Odd node #: node-mgmt sw A, RLM sw B

Even node #: node-mgmt sw B , RLM sw A




The key change in Boilermaker from 7G is that we now have a dual-stack architecture. However, when we say "dual-

stack" it tends to imply that both stacks have equal prominence. But in our case, the stack inherited from 7G, referred

to as the SK stack, owns the network interfaces in normal operation and runs the show for 7-mode and C-mode apps

for the most part. The FreeBSD stack inherited from GX runs as a surrogate to the SK stack and provides the

programmatic interface(BSD sockets) to the mhost apps to communicate to the network. The FreeBSD stack itself

does not directly talk to the network in normal operational mode. This is because it does not own any of the physical

network interfaces. FreeBSD stack maintains the protocol(TCP+UDP) state for all mhost connections and sets up the

TCP/IP frames over mhost data. It sends the created TCP/IP frames to the SK stack for delivery to the network. On the

ingress side, SK stack delivers all packets destined to the mhost to the FreeBSD stack.




Data ONTAP 8.0 makes a distinction between physical network ports and logical interfaces, or LIFs. Each port has a

role associated with it by default, although that can be changed through the UI. The role of each network port should

line up with the network to which it is connected.

Management ports are for administrators to connect to the node/cluster, for example, through SSH or a Web browser.

Cluster ports are strictly for intra-cluster traffic.

Data ports are for NFS and CIFS client access, as well as the cluster management LIF.




Using a FAS30x0 as an example, the e0a and e0b ports are defined as having a role of cluster, while the e0c and e0d

ports are defined for data. The e1a port would be on a network interface card in one of the four horizontal slots at the

top of the controller. The e1a port is, by default, defined with a role of mgmt.




The network port show command shows the summary view of the ports of this 4-node cluster. All the ports are

grouped by node, and you can see the roles assigned to them, as well as their status and Maximum Transmission Unit

(MTU) size. Notice the e1b data ports that are on the nodes, but not connected to anything.




A LIF in Cluster-Mode terminology refers to an IP and netmask associated with a data port.

Each node can have multiple data LIFs, and multiple data LIFs can reside on a single data port, or optional interface

group.

The default LIF creation command will also create default failover rules. If manual/custom failover rule creation is

desired, or if multiple data subnets will be used, add the "use-failover-groups disabled" or specific "-failover-group"

options to the "network interface create" command.




Data ONTAP connects with networks through physical interfaces (or links). The most common interface is an Ethernet

port, such as e0a, e0b, e0c, and e0d.

Data ONTAP has supported IEEE 802.3ad link aggregation for some time now. This standard allows multiple network

interfaces to be combined into one interface group. After being created, this group is indistinguishable from a physical

network interface.

Multiple ports in a single controller can be combined into a trunked port via the interface group port feature. An

interface group supports 3 distinct modes: multimode, multimode-lacp and singlemode, and the load distribution

selectable between mac, ip and sequential. Using interface groups will require matching configuration on the

connected client ethernet switch, depending on the configuration selected.




Ports are either physical ports (NICs), or virtualized ports such as ifgrps or vlans. Ifgrps treat several physical ports as

a single port, while vlans subdivide a physical port into multiple separate ports. A LIF communicates over the network

through the port it is currently bound to.




Using 9000 MTU on the cluster network is highly recommended, for performance and reliability reasons. The cluster

switch or VLAN should be modified to accept 9000 byte payload frames prior to attempting the cluster join/create.

Standard 1500 MTU cluster ports should only be used in non-production lab or evaluation situations, where

performance is not a consideration




The LIF names need to be unique within their scope. For data LIFs, the scope is a cluster virtual server, or vserver. For

the cluster and management LIFs the scopes are limited to their nodes. Thus, the same name, like mgmt1, can be

used for all the nodes, if desired.




A routing group is automatically created when the first interface on a unique subnet is created. The routing group is

role-specific, and allows the use of the same set of static and default routes across many logical interfaces. The default

naming convention for a routing group is representative of the interface-role and the subnet they are created for




The first interface created on a subnet will trigger the automatic creation of the appropriate routing-group. Subsequent

LIFs created on the same subnet will inherit the existing routing group.

Routing groups cannot be renamed. If a naming convention other than the default is required, the routing group can be

pre-created with the desired name, then applied to an interface during LIF creation or as a modify operation to the LIF




Routing groups are created automatically as new LIFs are created, unless an existing routing group already covers

that port role/network combination. Besides the node management LIF routing groups, other routing groups have no

routes defined by default.

The node management LIFs on each node have static routes automatically set up for them, using the same default

ga eway.

There is a “metric” value for each static route, which is how the administrator can configure which route would be

preferred over another (the lower the metric, the more preferred the route) in the case where there is more than one

static route defined for a particular LIF. The metric values for the node management LIFs are 10. When routes are

created for data LIFs, if no metric is defined, the default will be 20.




As with the network interface show output, the node management LIFs have a Server that is the node itself. The data

LIFs are associated with a cluster vserver, so they’re grouped under that.




Why migrate a LIF? It may be needed for troubleshooting a faulty port, or perhaps to offload a node whose data

network ports are being saturated with other traffic. It will failover if its current node is rebooted.

Unlike storage failover (SFO), LIF failover or migration does not cause a reboot of the node from which the LIF is

migrating. Also unlike SFO, LIFs can migrate to any node in the cluster, not just within the high-availability pair. Once a

s m gra e , can rema n on e new no e or as ong as e a m n s ra or wan s o.

We’ll cover failover policies and rules in more detail later.




•Data LIFs can migrate or failover from one node and/or port to any other node and/or port within the cluster

•LIF migration is generally for load balancing; LIF failover is for node failure

•Data LIF migration/failover is NOT limited to an HA pair

•Nodes in a cluster are paired as “high-availability” (HA) pairs (these are called “pairs,” not “clusters”)

•Each member of an HA pair is responsible for the storage failover (SFO) of its partner

•Each node of the pair is a fully functioning node in the greater cluster

•Clusters can be heterogeneous (in terms of hardware and Cluster-Mode versions), but an HA pair must be the same

controller model

•First, we show a simple LIF migration

•Next, we show what happens when a node goes down:

•Both data LIFs that reside on that node fail over to other ports in the cluster

•The storage owned by that node fails over to its HA partner

•The failed node is “gone” (i.e., its partner does not assume its identity like in 7G and 7-Mode)

•The data LIF IP addresses remain the same, but are associated with different NICs




Remember that data LIFs aren’t permanently tied to their nodes. However, the port to which a LIF is migrating is tied to

a node. This is another example of the line between physical and logical. Also, ports have a node vserver scope,

whereas data LIFs have a cluster vserver scope.

All data and cluster-mgmt LIFs can be configured to automatically fail over to other ports/nodes in the event of failure.

Can also be used for load-balancing if an N-Blade is overloaded .The TCP state is not carried over during failover to

.

Best practices is to fail LIFs from “even” nodes over to other “even” nodes and LIFs from “odd” nodes to other “odd”

nodes




The default policy that gets set when a LIF is created is nextavail, but priority can be chosen if desired.

In a 2 node cluster, the nextavail failover-group policy creates rules to fail over between interfaces on the 2 nodes. In

clusters with 4 or more nodes, the system-defined group will create rules between alternating nodes, to prevent the

s orage a over par ner rom rece v ng e a a s as we n e even o a no e a ure. or examp e, n a no e

cluster, the default failover rules are created so that node1 -> node3, node2 -> node4, node3-> node1 and node4->

node2

Priority rules can be set by the administrator. The default rule (priority 0, which is the highest priority) for each LIF is itshome port and node. Additional rules that are added will further control the failover, but only if the failover policy for that

LIF is set to priority. Otherwise, rules can be created but won’t be used if the failover policy is nextavail. Rules are

.

Once a rule is applied, the failover is complete.

Manual failover rules can also be created, in instances where explicit control is desired by using ‘disabled’ option.

.




As the cluster receives different amounts of traffic, the traffic on all of the LIFs of a virtual server can become

unbalanced. DNS load balancing aims at dynamically choose a LIF based on load instead of using the round robin way

of providing IP addresses.




With DNS load balancing enabled, a storage administrator can choose to allow the new built-in load balancer to

balance client logical interface (LIF) network access based on the load of the cluster. This DNS server resolves names

to LIFs based on the weight of a LIF. A vserver can be associated with a DNS load-balancing zone and LIFs can be

either created or modified in order to be associated with a particular DNS zone. A fully-qualified domain name can be

added to a LIF in order to create a DNS load-balancing zone by specifying a “dns-zone” parameter on the network

interface create command.

There are two methods that can be used to specify the weight of a LIF: the storage administrator can specify a LIF

weight, or the LIF weight can be generated based on the load of the cluster. Ultimately, this feature helps to balance

the overall utilization of the cluster. It does not increase the performance of any one individual node, rather it makes

sure that each node is more evenly used. The result is better performance utilization from the entire cluster.

.

are used when mounting a particular global namespace, the administrator can let the system dynamically decide which

LIF is the most appropriate. And once a LIF is chosen, that LIF may be automatically migrated to a different node to

ensure that the network load is remains balanced throughout the cluster.




The -allow-lb-migrate true option will allow the LIF to be migrated based on failover rules to an underutilized port on

another head. Pay close attention to the failover rules because an incorrect port may cause a problem. A good practice

would be to leave the value false unless you're very certain about your load distribution.

The -lb-weight load option takes the system load into account. CPU, throughput and number of open connections are

measured when determining load. These currently cannot be changed.

The -lb-weight 1..100 value for the LIF is like a priority. If you assign a value of 1 to LIF1, and a value of 10 to LIF2,

LIF1 will be returned 10 times more often than LIF2. An equal numeric value will round robin each LIF to the client.

This would be equivalent to DNS Load Balancing on a traditional DNS Server.




The weights of the LIFs are calculated on the basis of CPU utilization and throughput (Average of both is taken)

1. LIF_weight_CPU = ((Max CPU on node - used CPU on node)/(number of LIFs on node) * 100

2. LIF_weight_throughput = (Max throughput on port - used throughput on port)/(number of LIFs on port) * 100

The more the weight , lesser is the probability of returning a LIF associated.




Please refer to your Exercise Guide for more instruction.




NFS is the standard network file system protocol for UNIX clients, while CIFS is the standard network file system for

Windows clients. Macintosh® clients can use either NFS or CIFS.

The terminology is slightly different between the two protocols. NFS servers are said to “export” their data, and the

NFS clients “mount” the exports. CIFS servers are said to “share” their data, and the CIFS clients are said to “use” or

“ ”map e s ares.




•NFS is the de facto standard for UNIX and Linux, CIFS is the standard for Windows

•N-blade does the protocol “translation” between {NFS and CIFS} and SpinNP

•NFS and CIFS have a virtual server scope (so, there can be multiples of each “running” in a cluster)




NFS is a licensed protocol, and is enabled per vserver by creating an NFS server associated with the vserver.

Similarly CIFS is a licensed protocol, and is enabled per vserver by creating a CIFS server associated with the vserver




The name-service switch is assigned at a virtual server level and, thus, Network Information Service (NIS) and

Lightweight Directory Access Protocol (LDAP) domain configurations are likewise associated at the virtual server level.

A note about virtual servers--although a number of virtual servers can be created within a cluster, with each one

containing its own set of volumes, vifs, NFS, and CIFS configurations (among other things), most customers only use

one v r ua server. s prov es or e mos ex y, as v r ua servers canno , or examp e, s are vo umes.




The Kerberos realm is not created within a Data ONTAP cluster. It must already exist, and then configurations can be

created to associate the realm for use within the cluster.

Multiple configurations can be created. Each of those configurations must use a unique Kerberos realm.




The NIS domain is not created within a Data ONTAP cluster. It must already exist, and then configurations can be

created to associate the domain with cluster vservers within Data ONTAP 8.0.

Multiple configurations can be created within a vserver and for multiple vservers. Any or all of those configurations can

use the same NIS domain or different ones. Only one NIS domain configuration can be active for a vserver at one

me.

Multiple NIS servers can be specified for an NIS domain configuration when it is created, or additional servers can be

added to it later.




The LDAP domain is not created within a Data ONTAP cluster. It must already exist, and then configurations can be

created to associate the domain with cluster vservers within Data ONTAP 8.0.

LDAP can be used for netgroup and UID/GID lookups in environments where it is implemented .

Multiple configurations can be created within a vserver and for multiple vservers. Any or all of those configurations can

use the same LDAP domain or different ones. Only one LDAP domain configuration can be active for a vserver at one

time.




Each volume will have an export policy associated with it. Each policy can have rules that govern the access to the

volume based on criteria such as a client’s IP address or network, the protocol used (NFS, NFSv2, NFSv3, CIFS, any),

and many other things. By default, there is an export policy called default that contains no rules.

Each export policy is associated with one cluster vserver. An export policy name need only be unique within a vserver.

en a vserver s crea e , e e au expor po cy s crea e or .

Changing the export rules within an export policy changes the access for every volume using that export policy. Be

careful.




Export Policies control the clients that can access the NAS data in a Vserver. It is applicable to both CIFS and NFS

access. Each export policy consists of a set of export rules that define mapping of client, its permission and the access

protocol [CIFS, NFS]. Export policies are associated with volumes which by virtue of being associated to the

namespace by a junction controls the access to the data in the volume.




Export policies serve as access controls for the volumes. During configuration and testing, a permissive export policy

should be implemented, and tightened up prior to production by adding additional export policies and rules to limit

access as desired.




If you’re familiar with NFS in Data ONTAP 7G (or on UNIX NFS servers), then you’ll wonder about how things are

tagged to be exported. In Data ONTAP clusters, all volumes are exported as long as they’re mounted (through

junctions) into the namespace of their cluster vservers. The volume and export information is kept in the Management

RDB unit so there is no /etc/exports file. This data in the RDB is persistent across reboots and, as such, there are no

temporary exports.

The vserver root volume is exported and, because all the other volumes for that vserver are mounted within the

namespace of the vserver, there is no need to export anything else. After the NFS client does a mount of the

namespace, the client has NFS access to every volume in this namespace. NFS mounts can also be done for specific

volumes other than the root volume, but then the client is limited to only being able to see this volume and its

“descendant” volumes in the namespace hierarchy.

- .

separate volume be set up for that directory, followed by an NFS mount of that volume.

If a volume is created without being mounted into the namespace, or if it gets unmounted, it is not visible within the

namespace.




To prevent clock skew errors, ensure that the NTP configuration is working properly prior to the cifs server create

operation




The machine account does not need to be pre-created on the domain for a domain >= Windows 2000 (it will be

created during the vserver cifs create), but the userid/passwd requested does need to have domain join

permissions/credentials to the specified OU container.




In this slide, we see the user interface of the Windows Active Directory or domain controller where the machine

account has been created for the CIFS configuration of a vserver.




Active Directory uses Kerberos authentication, while NT LAN Manager (NTLM) is provided for backward compatibility

with Windows clients prior to Windows 2000 Server. Prior to Active Directory, Windows domains had primary and

secondary domain controllers (DCs). With Active Directory, there may be one or multiple Windows servers that work in

cooperation with each other for the Windows domain. A domain controller is now a role that is played by an Active

Directory machine.

When configuring CIFS, the domain controller information will be automatically discovered and the account on the

domain controller will be created for you. If the virtual server requires preferred DC ordering, this can be set via the

"vserver cifs domain preferred-dc add" command.




Some typical steps needed to configure CIFS for a cluster vserver are shown here. The first step is creating a CIFS

configuration. This is the CIFS server itself for the vserver vs1.

Three CIFS shares are created. The first one, root, represents the normal path to the root of the namespace. Keep in

mind that if the root volume or an other volumes have load-sharin LS mirrors, this normal ath will use the read-

only volumes. Therefore, a read/write share called root_rw needs to be created. If a client maps to the read/write

share, it will always use the read/write volumes throughout the namespace of that vserver (no LS mirrors will be used).

More details about mirrors, read/write, and read-only paths will be provided later.

The third share uses dynamic shares, based on the user name. For example, if user bill in the nau01 domain connectsto the CIFS server, and there is a path in the namespace of /user/bill, then the %u will be translated dynamically into

bill such that there is a share called bill that maps to the junction path /user/bill. While bill is on his PC in the nau01

domain, he can go to \\mycifs\bill and be put into whatever volume has a junction path of /user/bill.




Creating a CIFS configuration is the process of enabling a CIFS server for a given cluster vserver. It is, in effect,

creating a CIFS server. But remember that a CIFS server is specific to a vserver. To enable CIFS for another vserver,

you would need to create a vserver-specific CIFS configuration.




A CIFS configuration is limited in scope to a cluster vserver, and a vserver does not have to have a CIFS configuration.

As such, a vserver must exist before a CIFS configuration can be created.




Kerberos is sensitive to time skew among nodes and between the cluster and a Kerberos server. When multiple

machines are working together, as is the case with a Data ONTAP cluster and a Kerberos server, the times on those

machines should be within a few minutes of each other. By default, a five-minute time skew is allowed. A time skew

greater than that will cause problems. Time zone settings take care of machines being in different time zones, so that’s

not a problem. NTP is a good way to keep multiple machines in time sync with each other. You also can widen the

allowable time skew, but it’s best to keep the machine times in sync anyway.




Cluster-Mode allows concurrent access to files by way of NFS and CIFS and with the use of Kerberos. All of these

protocols have the concept of a principal (user or group), but they’re incompatible with each other. So, name mappings

provide a level of compatibility.

pr nc pa s exp c y con a n e oma n as par o e pr nc p e. ew se, er eros pr nc pa s con a n an

instance and a realm. NFS uses NIS to store its principle information and is simply a name (the NIS domain is implied

and so is not needed in the principal). Because of these differences, the administrator needs to set up rules (specific or

regular expression) to enable these protocols to resolve the differences and correlate these principals with each other.




There are no default mappings configured by default on a 8.0 cluster, so for multi-protocol access and unix <-->

windows username matching, generic name-mappings will need to be created using reguslar expressions.




Note the two backslashes between the CIFS domains and the CIFS user. Because these parameters take regular

expressions, and the backslash is a special character in regular expressions, it must be “escaped” with another

backslash. Thus, in a regular expression, two backslashes are needed to represent one backslash.




In the second example, a more dynamic mapping is set up. In this example, any user in the MYCIFS domain would be

mapped with a like-named UNIX user. So, user yoda in the MYCIFS domain a.k.a. MYCIFS\yoda would be mapped

with the UNIX user yoda.




There are a number of ways to protect your data, and a customer’s data protection plan will likely use all of these

methods.




Snapshot functionality is controlled by Management, which provides the UI for manual Snapshot copies and the Job

Manager policies and schedules for automated Snapshot operations. Each volume can have a Snapshot policy

associated with it. This policy can have multiple schedules in it, so that Snapshot copies can be created using any

combinations of hourly, daily, weekly, and so on. The policy also says how many of each of those to retain before

deleting an old one. For example, you can keep four hourly Snapshot copies, and when the fifth one is taken, the

oldest one is removed, such that a rolling window of the previous four hours of Snapshot copies is retained.

The .snapshot directories are visible and usable by clients, allowing users to restore their own data without the need

for administrator intervention. When the entire volume needs to be restored from a Snapshot copy, the administrator

uses the volume snapshot promote command, which is basically the same thing as doing a restore using SnapRestore

technology. The entire Snapshot copy is promoted, replacing the entire volume. Individual files can be restored only ifdone through a client.




The Snapshot copies shown here are scheduled Snapshot copies. We have three Snapshot copies that were taken

five minutes apart for the past 15 minutes, two daily Snapshot copies, six hourly Snapshot copies, and two weekly

Snapshot copies.




Note:

We recommend that you manually replicate all mirrors of a volume immediately after you promote its Snapshot copy. Not doing

so can result in unusable mirrors that must be deleted and recreated.




There are two Snapshot policies that are automatically created: default and none. New volumes are associated with

a default snapshot policy and schedule. The defaults provide 6 hourly, 2 daily and 2 weekly snapshots. A pre-defined

snapshot policy named "none" is also available for volumes that do not require snapshots.

A volume that has none as its Snapshot policy will have no Snapshot copies taken. A volume that uses the default

po cy w , a er wo wee s, ave a o a o en naps o cop es re a ne s x our y cop es, wo a y cop es, an wo

weekly copies).

Volumes are created by default with a 20% snapshot reserve.




New schedules for use with a snapshot policy can be defined via the "job schedule cron create" command.




•Cluster-Mode mirroring uses the new “Paloma” SnapMirror engine

•Cluster-Mode mirroring is only asynchronous

•Two flavors: load-sharing (LS) and data protection (DP)




Mirrors are read-only volumes. Each mirror is created with an association with a read/write (R/W) volume, and labeled

as either an LS or DP mirror. LS and DP mirrors are the same in substance, but the type dictates how the mirror is

used and maintained.

rrors are cop es o e r vo umes, an are on y as sync ron ze w e as e a m n s ra or eeps em,

through manual replication or scheduled (automated) replication. Generally, DP mirrors do not need to be as up-to-

date as LS mirrors, due to their different purposes.

Each mirror that is created can have a replication schedule associated with it, which determines when (cron) or howoften (interval) the replications are performed on this mirror. All LS mirrors of a volume are treated as a unified group;

they use the same schedule (which is enforced by the UI, that is, if you choose a different schedule for one LS mirror,

. ,

forced to use the same schedule as any other mirror.




All replication is done directly from the R/W volume to the appropriate mirrors. This is different from the cascading that

occurs within Data ONTAP 7G.

Creating a mirror, associating it with a source volume, and replicating to it are separate steps.

An LS or DP mirror can be promoted (like a restore using SnapRestore technology) to take the place of its R/W

volume.




The purpose of LS mirrors is to offload volumes (and a single D-blade) of read activity. As such, it is very important

that all mirrors are in sync with each other (at the same data version level). When a replication is performed of a

volume to its LS mirrors, all LS mirrors of a volume are synced together and directly from the volume (no cascading).

The way that an NFS mount is performed on a client, or which CIFS share is mapped, makes a difference in what data

s accesse rea wr e vo ume or one o s m rrors . e norma me o o moun ng e roo o a v r uaserver (vserver), for example, is mount <ip address>:/ /myvserver. This will cause the LS selection algorithm

to be invoked. If, however, the NFS mount is executed using the “.admin” path, as in mount <ip

address>:/.admin /myvserver, this mount from the client will always access the R/W volumes when traversing

the namespace, even if there are LS mirrors for volumes. For CIFS, the difference is not in how a share is accessed,

but in what share is accessed. If a share is created for the “.admin” path, then use of that share will cause the client toalways have R/W access. If a share is created without using “.admin,” then the LS selection algorithm will be used.

Clients are transparently directed to an LS mirror for read operations, rather than to the read/write volume, unless the

special “.admin” path is being used.




•Physically the same as DP mirrors, but managed differently

•Primarily used for load balancing of read requests

•LS mirrors of a single source volume are managed as a single group (the mirrors are always in sync with each other)

•On separate nodes than the source volumes, and on the same node

•Automatically available in the namespace




•Implicitly accessed by clients (for read access)

•Round Robin algorithm is used to select an LS, unless there is an LS on the node whose N-blade is fielding the

request

•Requires a special “.admin” mount (NFS) or share (CIFS) to access the R/W volume once the R/W has been

rep ca e o e m rror s

•The request coming in on Yoda always stays on Yoda and goes to its local LS

•The request coming in on Kenobi always stays on Kenobie and goes to its local LS

•The request coming in on Luke will go to the LS on either Yoda or Kenobi




When the / path is used (that is, the “/.admin” path is not used) and a read or write request comes through that path

into the N-blade of a node, the N-blade first determines if there are any LS mirrors of the volume that it needs to

access. If there aren’t any LS mirrors of that volume, the read request will be routed to the R/W volume. If there are LS

mirrors of it, preference is given to an LS mirror on the same node as the N-blade that fielded the request. If there isn’t

an LS mirror on that node, then an up-to-date LS mirror from another node is chosen.

If a write request goes to an LS mirror, it will return an error to the client, indicating that this is a read-only file system.

To write to a volume that has LS mirrors, the “.admin” path must be used.

For NFS clients, an LS mirror is used for a set period of time (minutes), after which a new LS mirror is chosen. Once a

file is opened, different LS mirrors may be used across different NFS operations. The NFS protocol can handle the

.

For CIFS clients, the same LS mirror will continue to be used for as long as a file is open. One the file is closed, and

the period of time expires, then a new LS will be selected prior to the next file open operation. This is done because

the CIFS protocol cannot handle the switch from one LS mirror to another.




If a load-sharing mirror is lagging behind the most up-to-date load-sharing mirror in the set, the exported-snapshot field will show a dash (-)




When a client accesses a junction, the N-blade detects that there are multiple MSIDs and will direct the

packet to one of the LS mirrors. It will prefer a LS mirror on the same node.

Since the LS mirror is the default volume there is a se arate ath to access the RW volume: /.admin

Each vserver has an entry in the root called .admin. It can only be accessed from the root. When passing

through this path, all packets will be directed to the RW volumes




•Very similar to 7G asynchronous mirroring

•Primarily used as online backups: reliable sync, easy restore

•Consider using inexpensive, high-capacity (and slower) SATA disks for DP mirrors

•Point-in-time read-only copies of volumes, preferably on separate nodes than the source volumes

•Not implicitly accessed by clients, but can be “mounted” into the namespace

•Replicated independently of the LS mirrors and of each other (other DP mirrors of the same source volume)




Replication in BR is through Paloma (Logical replication engine). This will change again to block replication engine in

Rolling Rock.

Volume mirror (SnapMirror) relationships need to be removed and recreated during the upgrade from ONTAP GX

10.0.4 to 8.0.

e or g na goa was o conver ex s ng vo ume m rrors o a oma, u was no mp emen e . e same app es o

upgrades from 8.0 to 8.1.




The Administrative function in the M-Host is responsible for maintaining the peer-to-peer relationships and scheduling

transfers in the context of a relationship. It also ensures that no more than the maximum permissible transfers can be

initiated at any point in time.

e a a over unc on n e - a e s respons e or erenc ng e en e snaps o s a e ource an

transferring the data over the wire in conformance with the relevant protocol. The Data Mover function at the

destination then lays out the data at the destination Data Container object, appropriately. The limit on the maximum

amount of data that can be transferred in the context of a transfer session is also ensured by the Data Mover engine




Mirrors use schedules directly, whereas Snapshot copies are controlled by Snapshot policies, which in turn contain

schedule(s). For mirrors, the schedule is defined as part of the mirror definition. For Snapshot copies, the Snapshot

olic is defined as art of the R/W volume definition.

The schedules are maintained under the job command directory. There are some schedules defined by default, as

shown in this example. If a mirror is assigned the 5min schedule, for example, the mirror will be replicated every five

minutes, based on the system clock. If a Snapshot policy uses the hourly schedule, a Snapshot copy will be created

at five minutes after every hour.




Here we see that the volume called root has three mirrors—two DP mirrors and one LS mirror.

The instance view of the root_ls2 mirror shows the aggregate on which the mirror lives, when it was last replicated, as

well as other information.




There is no native “backup” or “restore” commands. All tape backups and restores are done through third-party NDMP

applications.

Consider the use of a stretched cluster to have a cluster that is geographically distributed, in case a disaster hits one

s e. e a a can e m rrore o an ac e up a a secon ary s e.




Below are the steps for Volume Copy in the cluster :

Create destination volume

Create an initial snapshot on the source volume

Read all of the snapshots that exist on the source volume

Begin copying snapshots over to the destination, one by one.

- The copy job will put an ownership tag on all of the snapshots

- ,

-The snapshots are copied from oldest to newest

Once all snapshots are transferred, delete the initial snapshot.

Convert the destination volume into a RW volume

: s a es prac ce o avo e e ng snaps o s ur ng vo ume copy




Create the destination volume

Create an initial reference snapshot on the source volume

The move job will put an ownership tag on this initial snapshot

Begin copying the snapshots over to the destination serially

The snapshots are copied from oldest to newest

Since this is a non-disruptive move, the admin can create new snapshots on the source

volume while the job is running

After first snapshot completes, only blocks are transferred

Once all of the snapshots have been transferred, the move job will check to see how much

data has been written to the active file system since the last snapshot

If the delta is large, then the move job will create a new snapshot and transfer this data

to the destination.

Process is then repeated

After completion, snapshot tags are then transferred




Next step, is to move into a “lockdown” state

At this point in time, the job will quiesce the volume, fencing off I/O

The job will then take one final snapshot and transfer remaining data

Once the final transfer is completed, the job moves into the final commitment phase

The source volume is quiesced on disk

The destination volume MSID is changed to that of the original source

The VLDB content for the source and destination volumes are swapped

,

I/O fencing period is now over

Finally, the move job moves into a finish state and the following occurs

Delete the original source volume

Delete any snapshots from the destination that were created by the move job

emove any move o re a e owners p rom e snaps o s on e vo ume




The data copy to the new volume is achieved by a series of copies of the

snapshots – each time copying a diminishing delta from the previous snapshot

copy.

Only in the final copy is the volume locked for I/O while the final changed blocks

are copied, and the file handles are updated to point to the new volume. This

should easily complete within the default NFS timeout (600 seconds) and almost

always within the CIFS timeout period of 45 seconds. In some very activeenvironments, sufficient data will have changed that it will take a longer period of

time to copy than the timeout period. In this case, the end-user sees the same

effect as if the drive had become disconnected – they will simply need to re-

connect to the share and re-try the operation.


5



Architecting a storage system for a point in time is quite straightforward. The

requirements are known for that point in time and the storage system is selected to

meet those requirements.

However, requirements change. The requirements that were true on day 1 may no

longer be true on day 90. Therefore, you need a storage system that allows

rebalancing without requiring downtime for your applications.

Let’s consider a hypothetical example. The diagram shows a Cluster-Mode

system that was purchased for new projects. The initial usage projections were

incorrect and over time the volumes hosted by the controllers in the middle have

grown much more quickly than the controllers on the outside. This has lead to a

serious imbalance. Moving volumes is very simple and with a little bit of

preventative oversight that need never have happened, but sometimes there’s just

not enough time to do everything all the time. With Cluster-Mode, it is easy to

rebalance even a severely imbalanced system. The administrator decides wherethe various volumes ideally belong, and in a few minutes can queue up all the

volume move jobs needed to rebalance the system. The physical movement of

the FlexVol in this manner has no impact on the namespace and is always 100%

transparent to clients and applications. Some time later all the jobs complete and the system


5

s ac n a ance.



The same capability can be used to optimize performance for critical projects. In

many types of work there are important “crunch times” where the project

absolutely must complete by deadline. One option is to buy such a very large

system that guarantees that any critical projects complete on time. That might be

very expensive. An alternative possible with Cluster-Mode is to reallocate a

smaller pool of available resources to the prefer the critical project.

Let’s assume that Project A is critical and will be the top priority starting next

.

free up resources for the critical project. The other projects may now get less

performance, but that’s a trade-off you can control. The system is capable of

fluidly adjusting with business cycles and critical needs.




Another Cluster-Mode capability is to transparently grow the storage system. The storage administrator

can add new nodes to the system at any time, and transparently rebalance existing volumes to take

advantage of the new resources.




Backups are the one thing in Data ONTAP clusters that are not cluster-aware. As such, the backup administratorneeds to be aware of what volumes are on what nodes (determined by volume show queries by node), and the

backups of the volumes on each node need to be performed through their respective nodes.

ac ups can e one across e c us er us ng -way , prov e a e r -par y ac up app ca on s g ven

access to the cluster network.

It may be tempting to assume that a backup of a volume includes all the volumes that are mounted under it, but that’s

not the case. NDMP backups do not traverse junctions. Therefore, every volume that is to be backed up needs to belisted explicitly. The exception to that is if the backup vendor software supports an auto-discovery of file systems, or

supports some sort of wildcarding.

Although backing up through an NFS or CIFS client is possible, doing so would utilize all the cluster resources that are

meant to serve data, as well as filling the N-blade caches with data that most clients aren’t actually using. The best

practice is to send the data through a dedicated Fibre Channel connection to the tape device(s) using NDMP, as this

doesn’t tax the N-blade, data network, or cluster network. But using NFS or CIFS is the only way (at this time) to back

up full-striped volumes. The legacy GX data-striped volumes can be backed up through NDMP.




User space core dumps are named according to the process name (for example, mgwd) and also use the process ID

(pid) of that instance of the process that generated the core file.

Kernel core dumps include the sysid, which is not the node name, but a numerical representation of this node. The

date and time in the core dum name indicate when the anic occurred.

When a node panics, a kernel core dump will be generated. There are times, however, when a node is up and running,

but having issues that cannot be debugged live. NetApp Global Support may request that a system core dump be

generated for one or multiple nodes to capture the complete picture of what is happening at that time. If a node is

healthy enough to issue UI commands, then a system reboot command can be entered with the -dump trueparameter. If a node is not healthy enough for that, then from the RLM session to that node, the system core RLM

command can be used to generate a core dump.

RLM is an out-of-band connection to a node that allows for some management of a node even when it is inaccessible

from the console and UI. The RLM connection has a separate IP address and has its own shell. Some sample RLMcommands are system power off, system power on, system reset, and system console.




Core files are meant to be examined by NetApp Global Support and should be reported and uploaded to NetAppGlobal Support. The default location to which core dumps should be uploaded (as shown through system coredumpconfig show) is ftp://ftp.netapp.com/to-ntap/.




The cluster is maintained by constant communication over the cluster network. As such, the cluster network must be

reliable. One of the first things to check when there are problems is the health of the cluster.

Each node writes to log files locally on that node. Those log files are only local and do not contain log messages from

the other nodes. Lo messa es also are written to the Event Mana ement S stem EMS and that enables an

administrator on one node (using the UI) to see the event messages from all nodes in the cluster.

ASUPs are also a great way to get the log files.




While a node is booting, and until the vol0 volume is available, all logging goes to /var/log/. After vol0 is available, the

logging goes to /mroot/etc/log/.

Each process has its own log file, for example, mgwd.log, vldb.log, and vifmgr.log.

, ,

for example, vldb.log.1, vldb.log.2, and so on.

EMS messages are available to be viewed through the UI. The D-blade, N-blade, and Management event log

messages go to the EMS log. The EMS log is rotated once a week, at the same time that the AutoSupport messages

are sent out.

e tail comman w pr nt t e ast ew nes o a e to t e screen. e -f ag causes t to cont nuous y

refresh that output as new data is written to that file. Using tail -f for a log file is a great way to watch the logging

as it happens. For example, if you run a command in the UI and get an error, you could open up another window tothat node, run the tail -f command on the log file that you think may provide for information for this error, and then

go back to the other window/browser and run the UI command again. This helps to establish the cause-and-effect

relationship between a UI command and a log message.




Logs live on the mroot. You may access logs by logging into the FreeBSD shell on the system which is running

Clustered ONTAP. Logs are located in /mroot/etc/log. You may copy individual logs to another system using the

secure copy command 'scp' from the FreeBSD shell




Beware that ps and top can be a bit confusing, due to the way the schedulers operate. From FreeBSD, the CPUs will

look 100% busy, but that’s because they’re actually be managed by a scheduler other than the normal FreeBSD

scheduler.




If you run the ps command and don’t see processes like vldb, mgwd, or vifmgr, then something is wrong. For

example, if the vldb process is not running, you’ll want to look at the vldb.log* files to see if there is an indication

of what happened.




•A process has to register with the Service Process Manager (SPM) to be managed.

•If the process is not running, SPM will restart the process. It will generate an EMS message when a process dies.

•If a process has reached its threshold number of restarts, SPM will shift to an interval-based restart and generate an

ASUP. Currently the threshold is 10 restarts in an hour.

•Interval-based restarts ranges from 5 minutes to a maximum of 60 minutes per process. The interval between restarts

is twice that of the last value, for example, 5 minutes, then 10 minutes, then 20 minutes, up to 60 minutes. Any further

retries will be happening once every hour after the first 15 (10+5) retries.

•Process manager never gives up managing a process.




vreport is a tool that scans the vldb and dblade for differences and reports it.

The output is generated for differences in aggregates, volumes, junctions and snapmirrors.

Once the differences are generated, vreport provides the option to fix any of the differences. This tool does not have

the authority to change values in the d-blade. Using the fix option on any difference would modify ONLY the vldb to be

consistent with the d-blade.




The cluster commands are a quick check as to the health of the cluster. Remember that a 2-node cluster needs to

have “2-node HA” enabled. If this step is forgotten, problems will arise, especially during storage failover (SFO).

The cluster ping-cluster command is a great way to make sure that all the cluster ports and cluster logical

interfaces (LIFs) are working properly.




For the most part, these commands are self-explanatory. Most show commands give you a good picture of what’s

happening in a particular area of the cluster. Also, most show commands have some powerful query capabilities that, if

you take the time to learn, can greatly pinpoint potential problems.

In the volume show -state !online command, the exclamation point means “not” (negation). So, this command

’ on ne. on ne o ne,

there are other states that you’ll want to know about.




When the aggregates of one node fail over to the high-availability (HA) partner node, the aggregate that contains the

vol0 volume of that node goes too. Each node needs its vol0 to boot, so when the rebooted node begins to boot, the

first thing that happens is that it signals the partner to do a giveback of that one aggregate and then it waits for that to

happen. If SFO is working properly, giveback will happen quickly, the node will have its vol0 and be able to boot, and

then when it gets far enough in its boot process, the rest of the aggregates will be given back. If there are problems,

you’ll probably see the rebooted node go into a “waiting for giveback” state. If this happens, it’s possible that its

aggregates are stuck in a transition state between the two nodes and may not be owned by either node. If this

happens, contact NetApp® Global Support.




It is a best practice for the vol0 aggregate to contain only the vol0 volume. The reason for this is because during

giveback the vol0 aggregate must be given back to the home node, and must complete that giveback, before any other

aggregates can be given back. The more volumes that are on that aggregate, the longer the vol0 giveback will take,

and thus the longer the delay before all the other aggregates can be given back. The exception to this best practice

would be during an evaluation or proof of concept, where a configuration may only contain one or two disk shelves per

node.




We want to make sure that all the network ports are OK, including the cluster ports. If those are fine, then take a look

at the LIFs. Make sure they’re working properly, and make note of which ones are home and which ones aren’t. Just

because they’re not home doesn’t mean that there is a problem, but it might give you a sense of what’s been

happening.

pcpcon g , pcpcon g ,

not a clustershell command).




WAFLtop allows customers to map the utilization at a higher level to their applications based on the volume or the type

of client/internal protocol, and possibly use this information to identify the source of bottlenecks within their systems.

The command is also useful internally to monitor performance and identify bottlenecks.

e mos common use case or op can e escr e as o ows:

1. Customer sees some degradation in performance, in terms of throughput or response time, on their system.

2. Customer wishes to determine if there is a volume or a particular application or client protocol which is consuming

resources in a way that leads to the degradation.

3. Customer can look at sysstat and other utilities to determine overall system usage of resources.4. Customer can additionally look at the output of WAFLtop to determine the topmost consumers of various system

resources. Based on this information, the customer may be able to determine the cause of the degradation.




vmstat -w 5

will print what the system is doing every five seconds; this is a good choice of printing interval since this is how

often some of the statistics are sampled in the system. Others vary every second and running the output for a

while will make it apparent which are recomputed every second.

node4# vmstat -w 5

procs memory page disk faults cpu

r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id

1 1 0 145720 254560 20 0 0 0 18 0 0 463 433 531 0 100 0

0 1 0 145720 254580 46 0 0 0 33 0 0 192 540 522 0 100 0

6 1 0 145720 254560 38 0 0 0 32 0 0 179 552 515 0 100 0

0 0 0 145164 254804 7 0 0 0 13 0 0 174 297 468 0 100 0

0 0 0 145164 254804 0 0 0 0 0 0 0 182 269 512 0 100 0




The rdb_dump utility is a tool that is run from the systemshell. It gives us a cluster-wide view of which RDB units are

healthy and which aren’t. If any are not healthy, rdb_dump might give a decent picture of which ring member (node) is

not healthy. But if the node on which this command is being invoked is the unhealthy one, then it’ll just look like

everything is bad, which is misleading.

r _ ump c , r _ ump ,

allowing you to see if something is going in and out of quorum.




This section of output from rdb_dump shows three RDB units (Management, VifMgr, and VLDB) of a 2-node cluster.

From this partial output, we can see that the first two units are healthy. If one or more of the nodes were in an offline

state, it would indicate some issues that are affecting the RDB units, most likely cluster networking issues.

Notice that this concise view does not show you the names of the nodes, although you can tell that the ID 1001 is this

oca no e.




This rdb_dump -f output is not quite as concise as rdb_dump, but it shows useful information, including the

correlations between the IDs and the host names.

To see more rdb_dump options, run rdb_dump -help.




Given the correct configuration, the health information summarizes the status of the replication group

Health information obtained from the master is always the most accurate. There is a slight delay in the propagation of

secondary information to other secondaries, but they will come into agreement




There are specific category values that can be used. The object parameter can be used to specify a particular

instance, for example, a volume name or aggregate name. There are a number of counter values that can be used.

Notice that some use a hyphen within the string while others use an underscore. Running stat show with no other

parameters will show all possible categories and counters.

You can narrow down your output by being more specific with your query. You cannot narrow things down by virtualserver, as the statistics command is either cluster-wide or node-specific. Because volume names only have to be

unique within a virtual server, if you query on a volume name (using the object parameter), you may see multiple

volumes with the same name, perhaps even on the same node.




The category parameter has a finite set of keywords, as shown here.




This output shows statistics specifically for all NFS “categories” and only for this node.




A typical CIFS log message shows the error code (316), as well as the file and line number in the Data ONTAP® 8.0source code that issued the error message. From the systemshell, there is a tool called printsteamerr that will

translate the error code into a more useful error string. In this example, code 316 gets translated into “cifs: access

denied.”




Althought tcpdump is no longer used to capture traces, it is still used to look at the pktt traces.




We generally want to see the same network bandwidth coming in and going out. A good rule to follow is to match the

cluster and data network bandwidth, that is, if using four cluster ports, then use four data ports. We have some

guidelines on the number of ports to use for maximum performance. This takes into account cached workloads as well.

Noncached workloads can probably decrease the port count by one data port per node (as compared to the number of

cluster ports).




Using LS mirrors of vserver root volumes is very important for high-availability access to the other volumes in the

namespace. As such, LS mirroring does not require a separate license, but rather is included in the Base license. A

best practice is to create LS mirrors of the vserver root volume and to situate one on each node, including the node

that contains the vserver root volume itself.

“Splitting” a volume is a manual process. For example, if a volume has two directories to which many writes are being

sent, and such that this volume has become a hot spot, then that volume can be divided into two volumes. With a new

volume on another node, the contents of one of those directories can be moved (by way of NFS or CIFS commands)

into the new one, and then that new volume can be mounted into the namespace at the same point as the original

directory. The clients would use the same path to write the data, but the writes would go to two separate volumesrather than one.




As part of benchmarking, it’s important to understand the capabilities and limitations of a single client within the context

of the benchmark. This data will allow for a better understanding of the results when a group of clients are running the

benchmark.




The number of nodes in a Data ONTAP cluster has everything to do with scaling performance higher, while having

very little effect in terms of overhead. NetApp refers to this as “near-linear performance scalability.” This means that a

2-node cluster has about twice the performance of a single node, and a 4-node cluster has about twice the

performance of a 2-node cluster, and so on.

How the data is distributed in the cluster is a big deal. Variables like how the namespace is distributed across nodes

and how the striped member volumes are distributed have a major impact on performance. Some customers do not

take advantage of spreading out volumes (and work) across nodes, and instead configure one large volume per node

(7-mode style). Doing something like this will negate many of the performance benefits of the cluster-mode

architecture.




SIO and IOzone are multithreaded benchmarking tools, while dd and mkfile are not. The tools that are not

multithreaded may not be able to accurately simulate a needed environment.




The dashboard commands provide quick views of the nodes and the cluster.




The following example shows detailed performance-dashboard information for a node named node13:

node::> dashboard performance show -node node13Node: node13

Average Latency (usec): 624usCPU Bus : 84%Total Ops/s: 27275Displaying the Performance Dashboard 108NFS Ops/s: 27275CIFS Ops/s: 0Data Network Utilization: 0%Data Network Received (MB/s): 0

Data Network Sent (MB/s): 0Cluster Network Utilization: 0%Cluster Network Received (MB/s): 0

Storage Read (MB/s): 0Storage Write (MB/s): 0CIFS Average Latency: 0usNFS Average Latency: 624us




By default, the performance dashboard displays the following information about system and cluster performance:

•Node name or cluster summary

•Average operation latency, in microseconds

•Total number of operations

•Percentage of data network utilization

•Data received on the data network, in MB per second

•Data sent on the data network, in MB per second

•Percentage of cluster network utilization

•Data received on the cluster network, in MB per second

•Data sent on the cluster network, in MB per second

•Data read from storage, in MB per second

•Data written to storage, in MB per second

The command can display a wide range of performance information; see the reference page for the command for

further details.

s per ormance v ew can e use n con unct on w t statistics show –node <node> –category

<category> to get more detailed statistics.




The command can display a wide range of information about storage utilization and trend; see the reference page forthe command for further details.

The following example shows storage utilization trend information for all aggregatesduring the past seven days:

node::> dashboard storage show -week

~1 day ~2 days ~3 days ~7 days

Aggregate Size Used Vols Used Vols Used Vols Used Vols Used Vols

--------- -------- ------- ---- ----------- ------- --- ------- --- ------- ---node1_aggr0

113.5GB 99.91GB 1 620KB 0 1.18MB 0 1.77MB 0 4.36MB 0

node1_aggr2

908.3GB 50.00GB 1 4KB 0 12KB 0 16KB 0 40KB 0

node2_aggr0

113.5GB 99.91GB 1 612KB 0 1.13MB 0 1.68MB 0 4.02MB 0

node3_aggr0

229.1GB 109.9GB 2 648KB 0 1.23MB 0 1.84MB 0 4.34MB 0

node3_aggr1

. .

node4_aggr0

229.1GB 99.92GB 1 624KB 0 1.18MB 0 1.74MB 0 4.06MB 0node4_aggr1

687.3GB 90.08GB 8 56KB 0 108KB 0 164KB 0 436KB 0

7 entries were displayed.




By default, the storage dashboard displays:

•Aggregate name

•Aggregate size, in GB

•Aggregate available space, in GB

•Aggregate used space, in GB

•Percentage of space used

•Number of volumes

•4–hour change in used size

•4–hour change in number of volumes

•8–hour change in used size

•8–hour change in number of volumes

•Operational status




Perfstat executes a number of different commands and collects all the data.

There are a number of commands that are familiar to those who are accustomed to Data ONTAP 7G, available at the

nodeshell.




The statistics show command displays performance statistics on a per-node basis.

The output of the command includes the following information:

• Node name

• Statistic category name

• Statistic instance name

• Statistic counter name

• Current statistic value

• Delta from last checkpoint

node3::> statistics show Node: node3

Cate or .Ob ect.Counter Value Delta

----------------------------------- ------------- ------------

node.node.cifs-ops 0 -

node.node.cluster-busy 0% -

node.node.cluster-recv 4.62GB -

node.node.cluster-sent 9.66GB -

latency.latency.mount-latency 67822us

.

node3::> statistics show -node node 3 -category latency -counter

cifs-ops

Node: node3

Category.Object.Counter Value Delta

--------------------------------- ------------- ----------


latency.latency.cifs-ops 0 -

24



Under the “admin” privilege, the statistics show –category processor command shows a basic view of the

utilization of each processor of a node.




Under the “diag” privilege, the statistics show –node mowho-05 –category processor –object

processor1 command shows a detailed view of the utilization of processor1 of node mowho-05.




The statistics periodic command runs until Ctrl-C is pressed, with each line of output reporting the stats since

the previous line of output (interval). The default interval is one second. When Ctrl-C is pressed, some summary data

is presented.

“s ou pu can e you a o . e c us er usy va ues are nonzero, s a goo n ca on a e user a a sn

being sent over the cluster links. The same is true if “cluster recv” and “cluster sent” values are in the KB range. So, if

there are ops going on with no data being sent over the cluster network, it shows that data is being served locally, like

when a lot of reads are being done to LS mirrors that are on the same nodes as the data LIFs being accessed by the

clients. When cluster traffic is happening, the “cluster recv” and “cluster sent” values will be in the MB range.

Some other good options to use with this same command are:

s a s cs per o c –ca egory a ency –no e no e

statistics periodic –category volume –node node –interval 1

statistics show –node node –category volume –object sio –counter *latency (This wildcard shows all the different

latency counters.)




The following example will print what the system is doing every five seconds. Five seconds is a good interval since this

is how often some of the statistics are sampled in the system. Others vary every second and running the output for a

while will make it apparent which are recomputed every second. Check the Web for vmstat usage information.

no e vms a -w

procs memory page disk faults cpu

r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id

1 1 0 145720 254560 20 0 0 0 18 0 0 463 433 531 0 100 0

0 1 0 145720 254580 46 0 0 0 33 0 0 192 540 522 0 100 0




To analyze the statit results, refer to the 7G man pages. The output is the same in 8.0 as it is in 7G.




The CIFS server in is essentially divided into two major pieces, the CIFS N-blade protocol stack and the Security

Services (secd) module.





new license.


3




new license.


5




new license.


7



From any node in the cluster, you can see the images from all other nodes. Of the two images on each node, the one that has anIs Current value of true is the one that is currently booted. The other image can be booted at any time, provided that the

release of Data ONTAP® 8.0 on that image is compatible with that of its high-availability (HA) partner and the rest of the cluster.




In the process of managing a cluster, it will probably be necessary to scale out at some point. Clusters provide a

number of ways to do this, many of which you’ve seen and performed already. This is a recap of some of the ways that

a cluster can scale.




If it’s determined that an aggregate needs to be re-created, or if the disks are needed for another purpose (for example, to grow a

different aggregate), you may need to delete an aggregate. The volumes need to be removed from the aggregate first, which can be accomplished by volume move if you don’t want to delete them.




Events messages beginning with callhome.* are a good collection to configure for initial monitoring. As the customer

becomes more familiar with the system, individual messages can be added or removed as required. The callhome.*

event names are the same events that trigger an autosupport.

Events that begin with callhome.* are configured with the internal "asup" destination that cannot be removed. Use the

"- - "




The definitions of the event severity levels are:

•EMERGENCY: The system is unusable.

•ALERT: Action must be taken immediately to prevent system failure.

•CRITICAL: A critical condition has occurred.

•ERROR: An error condition has occurred.

• .

•NOTICE: A normal but significant condition has occurred.

•INFORMATIONAL: An informational message.

•DEBUG: A debugging message.




There is only one event configuration. The named event destination(s) need to be created or modified appropriately,

for example, to indicate to which e-mail address certain event notifications should be sent. Event routes are

associations between predefined event messages and event destinations. You can enable the notification to a

destination of a message by modifying its destination value. This can also be done all together by using a regular

expression when specifying the event name in the event route modify command.




Event destinations can also be created for SNMP and syslog hosts. The SNMP capable events can be obtained via the

"event route show -snmp true" command.




Event routes have nothing to do with network routes but are merely associations between event messages and

destinations.




AutoSupport is NetApp's 'phone home' mechanism that allows our products to do automated configuration, status

and error reporting. This data is then used in a variety of critical ways:

Provides a wealth of data that can be mined for real-world issues and usage. This is especially valuable for

product management and engineering for product planning and for resolving case escalations. We also use this

.

Provides current status and configuration information for NGS (and customers) who use this information for case

resolution, system healthcheck and audit reporting, system upgrade planning, customer system inventory

reporting, and many other creative uses.

AutoSupport is now a M-Host (user space) process called notifyd

It collects information from the D-Blade, from Management Gateway(mgwd), from BSD commands, and from files




To fire a user triggered AutoSupport with TEST in the subject line.

cluster-mode: system node autosupport invoke <nodename>

Additional options are available in cluster mode and the nodename can be wildcarded.

The support transport of HTTP should not be modified if at all possible. Using the HTTP post method for ASUP insures

the highest level of reliability for the Autosupport messages that can contain large log attachments




users can be created to provide different access methods (ssh, http, console), authentication mechanisms (password,

publickey) and capabilities via profiles (admin, readonly, none or user-defined).




The statistics periodic command gives a good summary of operations within the cluster. It prints a line for

every given interval of time (the default is once a second) so that you can see real-time statistics.




The dashboard commands are meant to give summaries of what’s going on in the cluster. In particular, the

dashboard performance show command gives a quick view of the nodes in this cluster.




ClusterView is a Web-based tool that graphically displays performance, usage, and health information from a Data

ONTAP cluster. ClusterView is implemented as a set of Adobe® Flash® Web pages that are served up from any node

in the cluster. The user points a Web browser to one particular node, which is referred to as the "serving node.”

Dynamic content is constructed using performance, health, and resource utilization data that ClusterView periodically

fetches from the serving node. The serving node constructs this data by querying other nodes in the cluster as

appropriate.




This is a “dashboard” view.




This is a graphical representation of the space utilization: aggregates on the left and volumes on the right.




Complete manageability of Data ONTAP cluster systems will be provided by a combination of products. The scope of

each product is as follows:

•Operations Manager: discovery, monitoring, reporting, alerting, File SRM, Quota management

•Provisioning Manager: policy based provisioning of storage on cluster systems

•Protection Manager: policy based data protection and disaster recovery of cluster systems

•Performance Advisor: performance monitoring and alerting for cluster systems




Please refer to Appendix A in your Exercise Guide for answers.




Some notes about the root aggregates (one per node):

• The mroot volume of the node will reside (permanently) on that aggregate

• Only mroot volumes should be placed on these aggregates

• Improves resiliency

• S eeds u takeover and iveback

• Can use 2-disk RAID 4 if short on available disks




Upgrades can be staged by leaving the old image as the default, so that a reboot will not bring up the upgraded image.

Rolling upgrades of an HA pair are faster than parallel reboots.

u t p e no es on y one per pa r can e re oote n para e , ut e aware o quorum ru es w c eman t at

fewer than half of the nodes in a cluster be down/rebooting at any given time. Also, be aware of the LIF failover rules to

ensure that the data LIFs are not all failing over to nodes that are also being rebooted.




Certain management operations (for example, NIS lookup) happen over the management network port, which can be a

single point of failure. The cluster management LIF can take advantage of LIF failover functionality.




Be aware of the maximum number of volumes allowed per controller.

Keep the number of volumes per controller balanced across the cluster: distribute them evenly as they’re created, and

as things get unbalanced (due to volume deletions and volume size changes), use the volume move capability to

redistribute volumes accordin l .




Snapshot copies should be turned on for striped volumes, even if it’s to keep only one Snapshot per volume. Snapshot

copies are used as part of the overall consistency story for striped volumes.


cmode explained latest

Documents