storage best practices for electronic design automation

STORAGE BEST PRACTICES FOR ELECTRONIC DESIGN AUTOMATION Best Practices for managing Dell EMC Isilon storage for EDA workflows

ABSTRACT This paper describes best practices for setting up and managing an EMC Isilon cluster to store data for electronic design automation.

February 2017

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

Copyright © 2017 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be the property of their respective owners. Published in the USA 02/17 White Paper H11909.1

Dell EMC believes the information in this document is accurate as of its publication date. The information is subject to change without notice.

TABLE OF CONTENTS

ABSTRACT ................................................................................................................................1

EXECUTIVE SUMMARY ...........................................................................................................5

EDA WORKFLOW CHARACTERISTICS .................................................................................5

EDA WORKLOAD ANALYSIS ..................................................................................................7

Infrastructure ..................................................................................................................................... 7 Workload ........................................................................................................................................... 7

DATA LAYOUT RECOMMENDATIONS ...................................................................................7

OneFS Storage Efficiency ................................................................................................................. 7 Directory Structure and Layout .......................................................................................................... 8 OneFS Data Protection ..................................................................................................................... 9 Node Hardware Recommendations ................................................................................................ 10 Cluster Pool Size and Limits ........................................................................................................... 10 Small File Considerations ................................................................................................................ 11

CLUSTER HARDWARE AND OS RECOMMENDATIONS ................................................... 11

Recommended OneFS Release ...................................................................................................... 11 Disk and Node Firmware Recommendations .................................................................................. 11 SSD to Hard Drive Ratios ................................................................................................................ 11 Infiniband Backend Redundancy ..................................................................................................... 11

DATA TIERING, LAYOUT AND CACHING RECOMMENDATIONS ..................................... 12

Data Tiering ..................................................................................................................................... 12 Data Access and On-disk Layout .................................................................................................... 13 Endurant Cache .............................................................................................................................. 14 Attribute Optimization of Files and Directories ................................................................................ 14

OPTIMAL USAGE OF SSD SPACE FOR EDA WORKLOADS ............................................ 15

SSD Strategies ................................................................................................................................ 15 Global Namespace Acceleration ..................................................................................................... 16

NETWORK RECOMMENDATIONS ....................................................................................... 17

Connectivity Considerations ............................................................................................................ 17 Optimal Network Settings ................................................................................................................ 18 Subnet Segregation ......................................................................................................................... 18 Connection-balancing and Failover Policies .................................................................................... 18 Dynamic Failover ............................................................................................................................. 19 Multiple IP Addresses Per Interface ................................................................................................ 20 Use One Hot Standby Interface Per Pool ........................................................................................ 20 SmartConnect Pool Sizing .............................................................................................................. 20

PROTOCOL RECOMMENDATIONS ..................................................................................... 20

NFS Considerations ........................................................................................................................ 20 32-Bit File IDs .................................................................................................................................. 20 Client NFS Mount Settings .............................................................................................................. 21 Optimal Thread Count ..................................................................................................................... 21 NFS Connection Count ................................................................................................................... 21 NFSv3 vs. NFSv4 ............................................................................................................................ 21 Synchronous and Asynchronous Export and Mount Options .......................................................... 22

DATA AVAILABILITY AND PROTECTION RECOMMENDATIONS .................................... 22

Availability and recovery objectives ................................................................................................. 22 Snapshot Considerations ................................................................................................................ 23 Replication Considerations .............................................................................................................. 24 NDMP Backup Considerations ........................................................................................................ 25 Direct NDMP ................................................................................................................................... 25 Remote NDMP ................................................................................................................................ 26

DATA MANAGEMENT RECOMMENDATIONS .................................................................... 27

Quota Considerations ...................................................................................................................... 27

PERMISSIONS, AUTH AND ACCESS CONTROL RECOMMENDATIONS ......................... 27

NIS and Access Zones Best Practices ........................................................................................... 27

JOB ENGINE CONSIDERATIONS ......................................................................................... 28

CLUSTER MANAGEMENT RECOMMENDATIONS ............................................................. 29

PERFORMANCE TUNING ..................................................................................................... 29

Understanding Consolidated EDA Workflow ................................................................................... 29 Measuring Cluster Performance ...................................................................................................... 29 InsightIQ Implications ...................................................................................................................... 30

PERFORMANCE TUNING ..................................................................................................... 31

CLUSTER SIZING GUIDELINES ........................................................................................... 31

Data Lake vs. Pod Architecture ....................................................................................................... 31

CASE STUDIES ...................................................................................................................... 32

Customer A ..................................................................................................................................... 32 Customer B ..................................................................................................................................... 33 EDA Performance Best Practices Checklist .................................................................................... 33

SUMMARY .............................................................................................................................. 34

EXECUTIVE SUMMARY The EMC Isilon platform combines a scalable hardware platform with a parallel software architecture to optimize data storage for electronic design automation (EDA). Isilon scale-out network-attached storage is a fully distributed, symmetrical system that comprises clustered storage nodes. The Isilon OneFS® operating system unifies the memory, I/O, CPUs, and disks of the nodes to present a single, linearly scaling file system.

Addition of nodes adds capacity, performance, and resiliency to the cluster, and each node can process requests from EDA compute grid clients, while taking advantage of the entire cluster's performance. The Isilon architecture contains no single master for the data, no concept of a controller head, and no RAID groups. The result is a highly efficient, scalable architecture that overcomes problems in EDA data storage requirements.

One problem with traditional EDA storage architectures is that they create performance bottlenecks that worsen at scale. The controller is the primary bottleneck: Attaching too much capacity to the controller can saturate it. Storage system bottlenecks can negatively affect wall-clock performance for concurrent jobs, which in turn can lengthen the time to market for a new chip.

The distributed Isilon architecture eliminates the single-head CPU saturation point of a controller. A large number concurrent jobs, often resulting in substantial amounts of metadata operations, can run without saturating the storage system, shortening the time to market.

Traditional storage systems also use disk space inefficiently. The uneven utilization of capacity across islands of storage requires manual intervention to rebalance volumes across aggregates and to migrate data to an even level—work that increases operating expenses. The inefficient utilization also negatively effects capital expenditures because extra capacity must be set aside as storage overhead.

By way of contrast, OneFS evenly distributes data among a cluster's nodes, thereby maximizing storage efficiency. An Isilon cluster continuously balances data across all the constituent nodes, conserving space and eliminating much of the capacity overhead that traditional EDA storage systems require. Isilon’s efficient utilization of disk space and ease of use reduces both capital and operational expenses.

Traditional storage systems have multiple points of management. Each filer must be individually managed, and this management overhead increases the total cost of ownership and OpEx. The lack of centralized management also puts EDA companies at a strategic disadvantage because it undermines their ability to expand storage to adapt to ever-expanding datasets and fluctuating business needs, which can hamper efforts to reduce time to market. With its single volume, the Isilon architecture delivers a high return on investment by centralizing data management.

By scaling multi-dimensionally to handle the growth of EDA data, an Isilon cluster lets you adapt to fluid storage requirements, non-disruptively add capacity and performance in cost-effective increments, and improve run times for concurrent jobs. This paper describes the best practices for configuring and managing an EMC Isilon cluster in an electronic design automation environment.

EDA WORKFLOW CHARACTERISTICS The workflows, workloads, and infrastructure for chip design—combined with exponential data growth and the time-to-market sensitivity of the industry—constitute a clear requirement to optimize the system that stores EDA data. Summarizing the workflow, infrastructure, and workload typical of EDA sets the stage for identifying problems that undermine the performance, efficiency, and elasticity of most EDA storage systems.

Most EDA workflows include the following phases:

• Frontend design phases (logical design)

o Design Specification o Functional Verification o Synthesis o Logic Verification

• Backend design phases (physical design)

o Place and Route o Static Timing Analysis o Physical Verification o Tape-out

These phases interact to form a digital design flow for EDA:

Figure 1. A simplified digital design flow for EDA.

During the frontend design phases, engineers architect a chip design by compiling source files into a chip model. Engineers then verify the chip design by scheduling and running jobs in a large compute grid. A scheduler distributes the build and simulation jobs to the available slots on the compute resources. The workflow is iterative as the engineers refine the design, and it is the frontend design phase that generates the majority of the simulation jobs. Efficiency in creating, scheduling, and executing build and simulation jobs can reduce the time it takes to bring a chip to market.

The frontend phase generates an input/output (I/O)-intensive workload when a large number of jobs run in parallel: EDA applications read and compile millions of small source files to build and simulate a chip design.

A storage system manages the various design projects and files so that different users, scripts, and applications can access the data. During the frontend verification stages, the data access pattern tends to be random, with a large number of small files. The frontend workload requires high levels of concurrency because of the large number of jobs that need to run in parallel, generating a random I/O pattern.

During the backend design and verification phases, the data access pattern becomes more sequential. The backend workload tends to have a smaller number of jobs with a sequential I/O pattern that run for a longer period of time.

The output of all the jobs involved in a chip's design phases can produce terabytes of data. Even though the output is often considered scratch space, the data still requires the highest tier of storage for performance.

Within the storage system, EDA workflows tend to store a large number of files in a single directory, typically per design phase, amid a deep directory structure on a large storage system. The performance-sensitive project directories, including those for both scratch and non-scratch directories, dominate the file system.

The directories contain source code trees, frontend RTL files that define logic in Hardware Description Language (HDL), binary compiled files after synthesis against foundry libraries, and the output of functional verifications and other simulations. Typically, the frontend RTL project directories that contain source code are minimal in size, while the project directories used for simulations dominate the overall capacity utilization of the storage system.

For example, an EDA company might have the following type of hierarchy for chip design under a top-level directory named projects:

FUNCTION EXAMPLE DIRECTORY Frontend Design Data /projects/chip1_FE or /projects/chip1_RTL Scratch Space for Simulations /scratch/chip1_scratch, /scratch/chip1_verif Physical Layout /projects/chip1_layout, /projects/chip1_pnr Physical Verification /projects/chip1_physical_verif

Figure 2. Typical Project Directory Layout for Chip Design Data

EDA WORKLOAD ANALYSIS With an inadequate storage system, these kinds of directory structures can have implications for latency. If EDA engineers are running 500 jobs against the /scratch/chip1_verif directory while another user tries to interact with the directory, the user’s response time can be noticeably delayed.

The workflow for the projects requires backing up data and taking snapshots of the directories nightly, and sometimes retaining up to over a month worth of snapshots of project directories. The storage system may also include home directories for users and the safe but cost-effective archiving of design blocks for reuse in future projects.

After the frontend phase produces a design that works logically, the physical design phase converts the logical design into an image, verifies it, and prepares it for delivery to a foundry, which manufactures the chip. During the physical design and verification stage, files increase in size and the data access pattern becomes more sequential. The result is a single top-level GDS-II file that can be tens of gigabytes.

The GDS file then undergoes a process of top-level verification to ensure that the design is clean. The last phase uses software to perform multiple iterations of a design rule check (DRC) and a layout versus schematic test (LVS). The DRC makes sure that the chip's layout conforms to the rules required for faultless fabrication, and the LVS guarantees that the layout represents the circuit that you want to fabricate. For nanometer technologies, a design for manufacturing (DFM) solution also helps increase the yield.

If an EDA company lacks a storage system that can handle the multiple iterations of DRC, LVS, and DFM checks on a large file while on a tight deadline, the company might not have enough time to perform due diligence on the layout, with the result being costly delays during fabrication or a decrease in the yield.

Given the large number of files and directories as well as the growing size of design files over time and the verification tests that they require, the storage system must make file management easy while providing seamless, high-performance access.

Infrastructure A typical EDA infrastructure combines a large compute farm with storage servers, application servers, license servers, backup systems, infrastructure servers, and other components, all connected with an Ethernet network. EDA uses a large compute grid, or build farm, for high-performance computing, and the grid often consists of thousands of cores.

Compute farms are increasing in density. In the past, the compute farms were non-commodity RISC based UNIX servers with a limited amount of compute cycles, such as 4 CPUs in 6U chassis—a system that was neither dense nor fast. In contrast, today the compute grid comprises hundreds of commodity Linux servers.

Storage servers are also seeing exponential growth. In the past, storage was not a bottleneck but now with denser commodity compute hardware and more cores—as many as 10,000 to 25,000 at large EDA shops—the compute grid is becoming dense and extensive, leaving storage as the bottleneck. For many EDA companies, the storage system still has a legacy dual controller head architecture, such as a traditional scale-up NAS system.

Workload The clients and applications in the compute grid determine the characteristics of the storage system's workload:

Most EDA workloads access the storage system over Network File System (NFS). Across the EDA industry, I/O profiles vary by tool and design stage, from small random workloads to large sequential workloads. In general, the workload tends to be a mix of random and sequential input-output operations.

The frontend workflow requires high levels of concurrency rather than throughput. As hundreds of functional and logic verification jobs are run concurrently against a deep and wide directory structure, the workload induces tremendous amounts of metadata overhead, generating high CPU usage on the storage system. The data access pattern is random. In many cases, most file system operations retrieve attributes, perform lookups, or access other metadata. It is the metadata intensive nature of the workload that tends to saturate the storage subsystem, making the controller the bottleneck in a scale-up architecture.

In contrast to the frontend workflow, the backend design flow consists of running a fewer number of jobs, but it requires an increase in the amount of throughput because of the increase in file sizes. The data access pattern for physical verification becomes more sequential compared to functional verification, and the workload now exhibits increased read-write operations and a sequential data access pattern.

DATA LAYOUT RECOMMENDATIONS

OneFS Storage Efficiency A typical EDA data set usually consists of about 90 percent or more small files and 10 percent or less large files stored in a file system comprising thousands of project directories. About 60 percent of the data is active; 40 percent is inactive. Snapshots usually back up

the data for short-term retention combined with a long-term DR strategy, which typically includes replication to a secondary cluster, and disk-to-disk or disk to tape NDMP backups.

In this document, large files are considered as those which are 128KB or greater and small files are those less than 128KB. This is significant because at 128KB and above, OneFS uses erasure coding (FEC) to parity protect a file, which results in high levels of storage efficiency. Conversely, files less than 128KB in size are mirrored, so have a larger on-disk footprint.

For example, to store 100TB of an EDA data set that comprises 90 TB of large files and 10 TB of small files in an Isilon cluster requires about 142.5 TB of raw space at an overall efficiency of 70%, as illustrated in the following table:

Figure 3. OneFS storage efficiency table

In a production representative file count vs. capacity distribution scenario illustrated in the following charts, the overall level of storage efficiency is 79 percent:

Figure 4. Large files vs. small files by file count and by capacity.

Even though small files account for 96 percent of the total file count of this data set, they consume less than 7TB total, or 1 percent of the overall disk space.

Conversely, large files, which comprise just 4% of the total file count, occupy 99% of the overall space. This clearly demonstrates how large file efficiency via erasure coding offsets the penalty of mirroring of small files.

Note: The best practice for storage efficiency is to house datasets with a mix of file sizes and types on the cluster. This occurs naturally for EDA datasets.

OneFS also provides additional storage efficiency via its native, post-process deduplication engine, SmartDedupe. Consider running deduplication on any archive or secondary DR clusters. If system resources allow, deduplication can also be run during off-hours against other cold, inactive datasets on archive storage tiers or nodepools.

Directory Structure and Layout In general, it is more efficient to create a deep directory hierarchy that consolidates files in balanced subdirectories than it is to spread files out over a shallow subdirectory structure. The recommendation is to limit the number of files in any one directory to one hundred thousand.

Storing large numbers of files in a directory may affect enumeration and performance, but whether performance is affected depends on workload, workflow, applications, tolerance for latency, and other factors. To better handle storing a large number of files in a directory, use nodes that contain solid state drives (SSDs).

As a best practice, the number of directories in a directory should not exceed one hundred thousand. Directory depth is limited to 509 directories, and is determined by a maximum path length of 1,023 characters. However, depths greater than 275 directories may affect system performance.

The maximum number of hard links is 65,535 per cluster. However, setting the number of per-file hard links to higher than 1,000 can slow down snapshot operations and file deletions. This per-file value can be configured via the efs.ifm.max_links syscontrol.

The maximum number of open files is 315,000, and is driven by the number of vnodes available on the node. The quantity of vnodes available is a product of how much RAM the node contains. The maximum amount of open files per node is 90% of the maximum number of vnodes on that node, as expressed in the following formula:

kern.maxfiles = kern.maxvnodes * .9

Note: The OneFS protocol daemons, such as the input-output daemon (lwio), may impose additional constraints on the number of files that a node can have open. The protocol daemons typically impose such constraints because the kernel places limits on per-process memory consumption.

4TB files are currently the largest file size supported. OneFS dynamically allocates new inodes from free file system blocks. The maximum number of possible inodes therefore depends on the number and density of nodes in the cluster, as expressed by the following formulas:

• 4Kn drives: (number of nodes in the cluster) * (node raw TB) * 1000^4 * .99) / (8192 * (number of inode mirrors)) • 512n drives: (number of nodes in the cluster) * (node raw TB) * 1000^4 * .73) / (512 * (number of inode mirrors))

OneFS Data Protection An Isilon cluster eliminates much of the overhead that traditional EDA storage systems consume. By not having RAID groups, OneFS evenly distributes, or stripes, data among a cluster's nodes with layout algorithms that maximize storage efficiency and performance. The system continuously reallocates data across the cluster, further maximizing space efficiency. At the same time, OneFS protects data with forward error correction, or FEC—a highly efficient method of reliably protecting data.

The capacity overhead for the various levels of FEC protection with a growing node count are shown in the table below:

Figure 5. OneFS Protection overhead table

N+1n N+2n N+2d:1n N+3n N+3d:1n N+3d:1n1d N+4n N+4d:1n N+4d:2n

N+1 N+2 N+2:1 N+3 N+3:1 n/a N+4 N+4:1 N+4:23 2+1 (33%) 3x (67%) 4+2 (33%) 3x (67%) 6+3 (33%) 3+3 (50%) 3x (67%) 8+4 (33%) 3x (67%)4 3+1 (25%) 2+2 (50%) 6+2 (25%) 4x (75%) 9+3 (25%) 5+3 (38%) 4x (75%) 12+4 (25%) 4+4 (50%)5 4+1 (20%) 3+2 (40%) 8+2 (20%) 4x (75%) 12+3 (20%) 7+3 (30%) 5x (80%) 16+4 (20%) 6+4 (40%)6 5+1 (17%) 4+2 (33%) 10+2 (17%) 3+3 (50%) 15+3 (17%) 9+3 (25%) 5x (80%) 16+4 (20%) 8+4 (33%)7 6+1 (14%) 5+2 (29%) 12+2 (14%) 4+3 (43%) 16+3 (16%) 11+3 (21%) 5x (80%) 16+4 (20%) 10+4 (29%)8 7+1 (13%) 6+2 (25%) 14+2 (13%) 5+3 (38%) 16+3 (16%) 13+3 (19%) 4+4 (50%) 16+4 (20%) 12+4 (25%)9 8+1 (11%) 7+2 (22%) 16+2 (11%) 6+3 (33%) 16+3 (16%) 15+3 (17%) 5+4 (44%) 16+4 (20%) 14+4 (22%)

10 9+1 (10%) 8+2 (20%) 16+2 (11%) 7+3 (30%) 16+3 (16%) 15+3 (17%) 6+4 (40%) 16+4 (20%) 16+4 (20%)11 10+1 (9%) 9+2 (18%) 16+2 (11%) 8+3 (27%) 16+3 (16%) 15+3 (17%) 7+4 (36%) 16+4 (20%) 16+4 (20%)12 11+1 (8%) 10+2 (17%) 16+2 (11%) 9+3 (25%) 16+3 (16%) 15+3 (17%) 8+4 (33%) 16+4 (20%) 16+4 (20%)13 12+1 (8%) 11+2 (15%) 16+2 (11%) 10+3 (23%) 16+3 (16%) 15+3 (17%) 9+4 (31%) 16+4 (20%) 16+4 (20%)14 13+1 (7%) 12+2 (14%) 16+2 (11%) 11+3 (21%) 16+3 (16%) 15+3 (17%) 10+4 (29%) 16+4 (20%) 16+4 (20%)15 14+1 (7%) 13+2 (13%) 16+2 (11%) 12+3 (20%) 16+3 (16%) 15+3 (17%) 11+4 (27%) 16+4 (20%) 16+4 (20%)16 15+1 (6%) 14+2 (13%) 16+2 (11%) 13+3 (19%) 16+3 (16%) 15+3 (17%) 12+4 (25%) 16+4 (20%) 16+4 (20%)17 16+1 (6%) 15+2 (12%) 16+2 (11%) 14+3 (18%) 16+3 (16%) 15+3 (17%) 13+4 (24%) 16+4 (20%) 16+4 (20%)18 16+1 (6%) 16+2 (11%) 16+2 (11%) 15+3 (17%) 16+3 (16%) 15+3 (17%) 14+4 (22%) 16+4 (20%) 16+4 (20%)19 16+1 (6%) 16+2 (11%) 16+2 (11%) 16+3 (16%) 16+3 (16%) 15+3 (17%) 15+4 (21%) 16+4 (20%) 16+4 (20%)20 16+1 (6%) 16+2 (11%) 16+2 (11%) 16+3 (16%) 16+3 (16%) 15+3 (17%) 16+4 (20%) 16+4 (20%) 16+4 (20%)

Nu

mb

er

of

No

de

s

OneFS 7.2.0 +

OneFS 7.1.1 & Older

Node Hardware Recommendations Another key decision for cluster performance in an EDA environment is the type of nodes deployed. For active data, HPC and scratch purposes, the recommendation is to utilize either S210 or X410 nodes with at least 128GB RAM, 10Gb Ethernet interfaces, and enough SSD capacity to accommodate the desired metadata mirrors. That said, there are tradeoffs to be made:

Isilon S-nodes, which are designed primarily for transactional workloads, provide the highest SSD to HDD ratios and CPU core to capacity ratios. The S-nodes also allow for lower rates of data protection, which translates to higher levels of storage efficiency due to the lower protection overhead.

However, since the S-nodes have a smaller drive count than the X410’s, a considerably higher node quantity (ie. larger cluster) is required for a similar capacity footprint. For example, a 1PB node pool would require around 45 S210 nodes with 1.2TB hard drives, versus 19 X410 nodes with 2TB hard drives.

Note: The preferred practice for an EDA primary cluster is a maximum of thirty two nodes per cluster. This can be deployed in a combination of S-nodes and X-nodes, depending on the hot and warm dataset ratios. This allows a pair of thirty six ports unmanaged Infiniband switches to be used, compared to requiring larger modular director class IB switches. Also, by using a maximum of thirty two of the available Infiniband switch ports, one port is always free to cover any switch port hardware failures, and three spare ports are available for a new node pool, if necessary.

Isilon NL nodes are recommended for archive and cold data tiers, and also for secondary DR storage clusters.

The following table illustrates the node counts (plus storage capacities, and percentage usable space) for the protection level boundaries of S, X and NL nodes.

Figure 6. Table of node counts versus protection level changes

For example, an X410 node with 2TB drives is protected against two drive failures or one node failure, from three to six nodes. Because of its large HDD quantity (36 drives bays), it requires a higher level of protection once the seventh node is joined, moving to resilience against three concurrent drive failures or one node and one drive failure. By the time the eighteenth node is added, an even higher level of protection is needed to guard against three drive or three node failures.

On the other hand, an S210 cluster with 1.2TB drives is protected at two drive failures or one node failure up to seventeen nodes, and then requires protection against two drive or two node failures from eighteen nodes all the way upwards. This allows it to deliver a significantly higher level of storage efficiency in the eighty percent range.

Cluster Pool Size and Limits For optimal cluster performance, Isilon recommends observing the following OneFS SmartPools best practices:

• Ensure that cluster capacity utilization (HDD and SSD) remains below 90% on each pool. • If the cluster consists of both S and X nodes, direct the default file pool policy to ingest to S node pool. • A file pool policy can have three OR disjunctions and each term joined by the ORs can contain at most five ANDs. For example: (A

and B and C and D and E) or (F and G and H and I and J) or (K and L and M and N and O). • The number of file pool policies should not exceed thirty. More than 30 policies may affect system performance. • Define a performance and protection profile for each tier and configure accordingly. • Enable SmartPools Virtual Hot Spares with a minimum of 10% space allocation. This ensures that there’s space available for data

reconstruction and re-protection in the event of a drive or node failure, and generally helps guard against file system full issues. • Avoid creating hardlinks to files which will cause the file to match different file pool policies • If node pools are combined into tiers, the file pool rules should target the tiers rather than specific node pools within the tiers.

• Avoid creating tiers that combine node pools both with and without SSDs. • The number of SmartPools tiers should not exceed 5. Although you can exceed the guideline of 5 tiers, doing so is not

recommended because it might affect system performance. • Where possible, ensure that all nodes in a cluster have at least one SSD, including nearline and high density nodes. • Node pools with L3 cache enabled are effectively invisible for GNA purposes. All GNA ratio calculations are done exclusively for

node pools without L3 cache enabled. • Ensure that SSDs comprise a minimum of 2% of the total cluster usable capacity, spread across at least 20% of the nodes, before

enabling GNA. • For EDA workloads, metadata read/write acceleration is recommended. The metadata read acceleration helps with getattr,

access, and lookup operations while the write acceleration helps reduce latencies on create, delete, setattr, mkdir operations. Ensure that sufficient SSD capacity (6-10%) is available before turning on metadata read/write acceleration.

Small File Considerations In practice, an Isilon cluster typically delivers between 70 and 85 percent space efficiency for an EDA dataset. Since an EDA dataset has an extremely wide range of file sizes, it is the large files that dominate utilization, saving as much as 20 to 30 percent of capacity over traditional storage systems. As illustrated above, even when small files make up more than 90 percent of EDA dataset by file count, they consume only 10 percent or less of the capacity. As such, any inefficiencies in storing small files are overshadowed by the efficiencies in storing large files. And as an EDA data set increases in size, an Isilon cluster moves closer to 80 percent efficiency.

CLUSTER HARDWARE AND OS RECOMMENDATIONS

Recommended OneFS Release Isilon recommends running OneFS 7.2.1 or later for EDA customers, particularly in environments with large clusters. Isilon defines a large cluster as greater than 40 nodes, since this is the boundary at which SmartPools divides a cluster into a second node pool.

The recommendation is to upgrade OneFS to a newer major release once a year at a minimum.

OneFS 7.2.1 benefits for EDA include:

Dynamic NFS maxthreads allocation.

Faster and less disruptive group change handling

Improved NFS LKF advisory lock fail-over management, persisting lock state information across node failures

Support for twenty thousand NFS exports (this limit has been raised to forty thousand in OneFS 8.0)

Disk and Node Firmware Recommendations The recommended minimum OneFS firmware levels for EDA workloads are:

Disk Firmware: DSP 1.9 / DFP 1.12

Node Firmware: 9.3.2

SSD to Hard Drive Ratios A 6% to 10% SSD (solid state drive) to HDD (hard disk drive) capacity ratio is recommended on any nodepool where metadata write is enabled. This capacity will allow all the OneFS metadata mirrors for that nodepool to live on SSD, providing low latency access for metadata read and write operations.

A 2% SSD to HDD capacity ratio is recommended on any nodepools configured for metadata read acceleration.

If a cluster contains one or more nodepools without SSD, Global Namespace Acceleration (GNA) is recommended for metadata read acceleration across the cluster. The hard requirements for enabling GNA are that at least 20% of the nodes in the entire cluster contain SSDs, and that SSDs comprise a minimum of 1.5% of the total cluster usable capacity.

Infiniband Backend Redundancy An Infiniband fabric is utilized for the cluster’s private backend node interconnects, which can be thought of as a low-latency distributed systems bus. Using a redundant switch and configuring the second IB interfaces in each node on a separate IP subnet is vital for the cluster’s backend network resilience. Also, ensure that each Isilon cluster has its own dedicated IB switch pair, with no other devices connected.

Given the demanding levels of I/O that portions of the EDA workflow demand, ensuring that all nodes and IB switches support QDR (quad data rate) Infiniband is a strong recommendation. For example, this can be achieved by pairing X410 nodes with NL410 nodes in the same cluster, rather than mixing X410 nodes with previous generation NL400 DDR (double data rate Infiniband) nodes.

Note: Nodes with QDR Infiniband support IB cable lengths up to 100 meters. However, DDR Infiniband nodes are limited to cable runs up to 10 meters.

Additional tuning steps are not usually required for the OneFS Infiniband stack, OpenSM subnet manager, or OEM Infiniband switches.

However, in certain cases with a mixed-node cluster, configuring a particular nodepool to be the OpenSM master can improve traffic flow across the Infiniband network.

Note: OpenSM master configuration should only be implemented at the recommendation of and under the guidance of Isilon Support.

DATA TIERING, LAYOUT AND CACHING RECOMMENDATIONS

Data Tiering Logical and physical design and verification are complex stages in the Electronic design automation (EDA) process for semiconductor production. This design data is critical and time to market is paramount. Multiple engineers from different geographical sites will often be collaborating on the project, and all need timely access to the data.

Previous design data and other intellectual property is also heavily leveraged and reused in subsequent projects, so archive and retention requirements spanning decades are commonplace.

Figure 7. EDA Infrastructure

EDA workflows often employ an additional tier of storage, or ‘scratch space’, for the transient processing requirements of HPC compute farms. Typically, this has high transactional performance requirements and often employs SSD-based storage node. But, since scratch is temporary data, the protection and retention requirements are very low.

Using SmartPools, this can be achieved with a multi tier architecture using high performance nodes with SSD for both the performance and scratch tiers and high-capacity SATA-only nodes for the high-capacity archive tier. One file pool policy would restrict historical design data to the high-capacity tier, protecting it at a high level, while another would restrict current design data to the fastest tier at the default protection level.

The following screenshot shows the creation of an ‘archive’ file pool policy for historical design data, which moves files that have been unmodified for more than 30 days to a lower storage tier.

Figure 8. Creating a file pool policy

Data Access and On-disk Layout At the File Pool (or even the single file) level, Data Access Settings can be configured to optimize data access for the type of application accessing it. Data can be optimized for Concurrent, Streaming or Random access. Each one of these settings changes how data is laid out on disk and how it is cached.

Figure 9. OneFS data access settings

As the settings indicate, the Random access setting performs little to no read-cache prefetching to avoid wasted disk access. This works best for workload with only small files (< 128KB) and large files with random small block accesses.

Streaming access works best for sequentially read medium to large files. This access pattern uses aggressive prefetching to improve overall read throughput, and on disk layout spreads the file across a large number of disks to optimize access.

Concurrency (the default setting for all file data) access is the middle ground with moderate prefetching. Concurrency is the preferred access setting for mixed EDA workloads, as the compute grid consists of serving both EDA frontend and backend verification workloads against a given cluster.

Endurant Cache The recommendation for EDA environments is to disable the endurant cache (EC), or stable write cache. The Endurant Cache can get backlogged for many of the workflows, and disabling EC increases performance for EDA jobs by typically around 5-10%. Running the following command from the OneFS CLI will disable EC across all the nodes in a cluster:

# isi_sysctl_cluster efs.bam.ec.mode=0 Value set successfully

To verify that this configuration change is persistent, run:

# cat /etc/mcp/override/sysctl.conf | grep –i ec efs.bam.ec.mode=0 #added by script

EC can also be disabled for a particular directory, if necessary:

isi set -c [on|off|endurant_all|coal_only] <dirname> For coalescer but no endurant: isi set -c coal_only

Attribute Optimization of Files and Directories The attributes of a particular directory or file can be viewed by running the following command and replacing data in the example with the name of a directory or file. The command’s output below, which shows the properties of a directory named ‘data’, has been truncated to aid readability:

# isi get -D data POLICY W LEVEL PERFORMANCE COAL ENCODING FILE IADDRS default 4x/2 concurrency on N/A ./ <1,36,268734976:512>, <1,37,67406848:512>, <2,37,269256704:512>, <3,37,336369152:512> ct: 1459203780 rt: 0 ************************************************* * IFS inode: [ 1,36,268734976:512, 1,37,67406848:512, 2,37,269256704:512, 3,37,336369152:512 ] ************************************************* * Inode Version: 6 * Dir Version: 2 * Inode Revision: 6 * Inode Mirror Count: 4 * Recovered Flag: 0 * Restripe State: 0 * Link Count: 3 * Size: 54 * Mode: 040777 * Flags: 0xe0 * Stubbed: False * Physical Blocks: 0 * LIN: 1:0000:0004 * Logical Size: None * Shadow refs: 0 * Do not dedupe: 0 * Last Modified: 1461091982.785802190 * Last Inode Change: 1461091982.785802190 * Create Time: 1459203780.720209076 * Rename Time: 0 * Write Caching: Enabled * Parent Lin 2 * Parent Hash: 763857 * Snapshot IDs: None * Last Paint ID: 47

* Domain IDs: None * LIN needs repair: False * Manually Manage: * Access False * Protection True * Protection Policy: default * Target Protection: 4x * Disk pools: policy any pool group ID -> data target x410_136tb_1.6tb-ssd_256gb:32(32), metadata target x410_136tb_1.6tb-ssd_256gb:32(32) * SSD Strategy: metadata-write * SSD Status: complete * Layout drive count: 0 * Access pattern: 0 * Data Width Device List: * Meta Width Device List: * * File Data (78 bytes): * Metatree Depth: 1 * Dynamic Attributes (40 bytes): ATTRIBUTE OFFSET SIZE New file attribute 0 23 Isilon flags v2 23 3 Disk pool policy ID 26 5 Last snapshot paint time 31 9 ************************************************* * NEW FILE ATTRIBUTES * Access attributes: active * Write Cache: on * Access Pattern: concurrency * At_r: 0 * Protection attributes: active * Protection Policy: default * Disk pools: policy any pool group ID * SSD Strategy: metadata-write * *************************************************

Here is what some of these lines mean:

OneFS command to display the file system properties of a directory or file. The directory’s data access pattern is set to concurrency. Write caching (SmartCache) is turned on. The SSD strategy is set to metadata-write. Files that are added to the directory are governed by these settings, most of which can be changed by applying a file pool policy to the directory.

OPTIMAL USAGE OF SSD SPACE FOR EDA WORKLOADS

SSD Strategies In addition to traditional hard disk drives (HDDs), OneFS nodes can also contain a smaller quantity of flash memory-based solid state drives (SSDs). There are a number of ways that SSDs can be utilized within a cluster.

OneFS SSD Strategies are configured on a per file pool basis. These strategies include:

• Metadata read acceleration: Creates a preferred mirror of file metadata on SSD and writes the rest of the metadata, plus all the actual file data, to HDDs.

• Metadata read & write acceleration: Creates all the mirrors of a file’s metadata on SSD. Actual file data goes to HDDs. • Avoid SSDs: Never uses SSDs; writes all associated file data and metadata to HDDs only. This strategy is used when there is

insufficient SSD storage and you wish to prioritize its utilization. • Data on SSDs: All of a node pool’s data and metadata resides on SSD. • As mentioned previously, the recommendation for EDA is to enable the metadata write strategy on all primary storage nodes pools

with sufficient SSD capacity.

Global Namespace Acceleration The goal of Global Namespace Acceleration is to help accelerate metadata read operations (filename lookup, access, etc.) by keeping a copy of the metadata for the entire cluster on high performance, low latency solid state media. To achieve this, an additional mirror of the metadata from storage pools that do not contain SSDs is stored on any SSDs available. As such, metadata read operations are accelerated even for data on node pools that have no SSDs.

To prevent capacity or performance oversubscription of a cluster’s SSD resources, there are several requirements that need to be satisfied in order to activate GNA on a cluster:

• At least 20% of the nodes in the entire cluster must contain SSDs. • A recommended minimum of 2% of the total cluster usable capacity must be on SSD in order to enable GNA (in addition to any

SSD requirements to enable metadata write acceleration for EDA datasets). This equates to approximately 200GB of SSD for every 10TB of HDD capacity in the cluster.

o SSD capacity formula is: SSD/(SSD+HDD) o * in addition to any other considerations for metadata write acceleration.

• GNA will be disabled at any SSD ratio of 1.5% or below. • General SmartPools configuration requirements apply including:

o A minimum of three nodes per node pool o All nodes within a node pool must have the same configuration

• GNA is a cluster-wide feature and SmartPools licensing is required in order to enable heterogeneous cluster tiering strategies.

Assuming the above requirements are met, GNA can be enabled via the OneFS SmartPools WebUI. This can be found by navigating under File System Management -> SmartPools -> Settings

Figure 10. OneFS Metadata Strategy SSD Requirements

The following SSD strategy decision tree explains the options available:

Figure 11. SSD usage decision tree

In all these cases, ensure that SSD capacity utilization remains below 90%.

If snapshots are enabled on a cluster, use the SSD Strategy “Use SSDs for metadata read/write acceleration” to enable faster snapshots deletes. The SSD metadata write strategy will require 6-10% of a pool’s capacity on SSD to accommodate all the metadata mirrors.

NETWORK RECOMMENDATIONS

Connectivity Considerations For EDA workflows, the recommendation is to configure at least one 10 Gigabit Ethernet connections to each node to support the high levels of network utilization that take place, particularly during the simulation and verification phases.

A best practice is to bind multiple IP addresses to each node interface in a SmartConnect subnet pool. Generally, optimal balancing and failover is achieved when the number of addresses allocated to the subnet pool equals N * (N – 1), where N equals the number of node interfaces in the pool. For example, if a pool is configured with a total of five node interfaces, the optimal IP address allocation would total 20 IP addresses (5 * (5 – 1) = 20) to allocate four IP addresses to each node interface in the pool.

Note: For a largely scaled cluster, there is a practical number of IP addresses that is a good compromise between N * (N -1) approach and a single IP per node approach. Example: for a 35 node cluster, 34 IPs per node may not be necessary, depending on workflow.

Assigning each workload or data store to a unique IP address enables Isilon SmartConnect to move each workload to one of the other interfaces, minimizing the additional workload that a remaining node in the SmartConnect pool must absorb and ensuring that the workload is evenly distributed across all the other nodes in the pool.

Optimal Network Settings Jumbo frames, where the maximum transmission unit (MTU) is set to 9000 bytes, yield slightly better throughput performance with slightly less CPU usage than standard frames, where the MTU is set to 1500 bytes. For 10 Gb Ethernet connections, jumbo frames provide about 5 percent better throughput and about 1 percent less CPU usage.

Jumbo frames can be configured from the OneFS WebUI by navigating to Cluster Management -> External Network and clicking ‘Edit’ on the appropriate subnet(s):

Figure 12. Network configuration and Jumbo Frames

Subnet Segregation OneFS provides the ability to optimize storage performance by designating zones to support specific workloads or subsets of clients. Isilon recommends segregating different network traffic types on separate subnets using SmartConnect pools.

For large clusters, partitioning the cluster’s networking resources and allocate bandwidth to each workload minimizes the likelihood that heavy traffic from one workload will affect network throughput for another. This is particularly true for SyncIQ replication and NDMP backup traffic, which can definitely benefit from its own set of interfaces, separate from user and client IO load.

Many EDA customers as a best practice create separate SmartConnect subnets for the following traffic segregation:

• HPC server farm. • User workstation access. • Replication. • NDMP backup on target cluster. • Service Subnet for cluster administration and management traffic. • Separate subnet for SMB traffic.

Connection-balancing and Failover Policies By default, the OneFS SmartConnect module balances connections among nodes by using a round-robin policy and a separate IP pool for each subnet. A SmartConnect license adds advanced balancing policies to evenly distribute CPU usage, client connections, or throughput. It also lets you define IP address pools to support multiple DNS zones in a subnet. The licensed version supports IP failover to provide continuous access to data when hardware or a network path fails.

The licensed version of SmartConnect provides four policies for distributing traffic across the nodes in a network pool. Because you can set policies for each pool, static and dynamic SmartConnect address pools can coexist.

Figure 13. Example usage scenarios and recommended balancing options

Because of the demands that EDA places on CPU and other cluster resources, ‘round robin’ is the recommendation for both client connection balancing and IP failover. It allows for the incoming compute client connections to naturally get balanced across the compute grid over time.

These policies can be configured from the WebUI by navigating to Cluster Management -> External Network and clicking ‘Edit’ on the appropriate subnet pool(s) and setting ‘Client Connection Balancing Policy’ and ‘IP Failover Policy’ to ‘Round-robin’:

Figure 14. Load-balancing and IP failover configuration

Dynamic Failover Dynamic NFS failover can be configured from the WebUI by navigating to Cluster Management -> External Network and clicking ‘Edit’ on the appropriate subnet pool(s) and setting ‘Allocation Method’ to ‘Dynamic’:

Dynamic failover is recommended for EDA workloads on all SmartConnect subnets that handle NFS traffic, and also for SMB traffic from any Windows 2K12 or Windows 10 clients that support the SMB3 protocol.

For optimal network performance, observe the following SmartConnect best practices:

• Do not mix interface types (1G & 10G) in the same SmartConnect Pool • Minimize disruption by suspending nodes in preparation for planned maintenance and resuming them after maintenance is

complete • If running OneFS 8.0, leverage the groupnet feature to enhance multi-tenancy and DNS delegation, where desirable. • Ensure traffic flows through the right interface by tracing routes. Leverage OneFS Source-Based Routing (SBR) feature to keep

traffic on desired paths. • Use the ‘round-robin’ SmartConnect Client Connection Balancing and IP-failover policies for EDA workloads.

Multiple IP Addresses Per Interface This method consists of binding multiple IP addresses to each node interface in a SmartConnect pool. The ideal number of IP addresses per interface depends on the size of the pool. The following recommendations apply to dynamic pools only. Because static pools include no failover capabilities, a static pool requires only one IP address per interface.

Generally, you can achieve optimal balancing and failover when the number of IP addresses allocated to the pool equals N * (N – 1), where N equals the number of node interfaces in the pool.

If, for example, a pool contains five node interfaces, the optimal IP address allocation for the pool is 5 * (5 – 1) = 20 IP addresses. The equation allocates four IP addresses to each of the five-node interfaces in the pool.

For a SmartConnect pool with four-node interfaces in which the N * (N – 1) model is followed and results in three unique IP addresses being allocated to each node, a failure on one node interface results in each of that interface’s three IP addresses failing over to a different node in the pool, ensuring that each of the three active interfaces remaining in the pool receives one IP address from the failed node interface. If client connections to that node were evenly balanced across its three IP addresses, SmartConnect distributes the workloads to the remaining pool members evenly.

Assigning each workload a unique IP address allows SmartConnect to move a workload to another interface, minimizing the additional workload that a remaining node in the pool must absorb and ensuring that SmartConnect evenly distributes the workload across the surviving nodes in the pool.

If you cannot use multiple IP addresses per interface—because, for example, you hold a limited number of IP addresses or because you must ensure that an interface failure within the pool does not affect performance for other workloads using the same pool—you should consider using a hot standby interface per pool.

Use One Hot Standby Interface Per Pool Another approach is to configure each SmartConnect dynamic pool with (N – 1) IP addresses, where N is again the number of nodes in the pool. With this approach, a node failure leads SmartConnect to failover the node’s workload to the unused interface in the pool, minimizing the performance impact on the rest of the cluster. Although this approach protects workloads when a failure occurs, it requires leaving one or more node interfaces unused.

SmartConnect Pool Sizing To evenly distribute connections and optimize performance, Isilon recommends sizing SmartConnect for the expected number of connections and for the anticipated overall throughput likely to be generated. The sizing factors for a pool include:

• The total number of active client connections expected to use the pool’s bandwidth at any time. • Expected aggregate throughput that the pool needs to deliver. • The minimum performance and throughput requirements in case an interface fails.

Since OneFS is a single volume, fully distributed file system, a client can access all the files and associated metadata that are stored on the cluster, regardless of the type of node a client connects to or the node pool on which the data resides. For example, data stored for performance reasons on a pool of S-Series nodes can be mounted and accessed by connecting to an NL-Series node in the same cluster. The different types of Isilon nodes, however, deliver different levels of performance.

To avoid unnecessary network latency under most circumstances, the recommendation is to configure SmartConnect subnets such that client connections are to the same physical pool of nodes on which the data resides. In other words, if a workload’s data lives on a pool of S-Series nodes for performance reasons, the clients that work with that data should mount the cluster through a pool that includes the same S-Series nodes that host the data.

PROTOCOL RECOMMENDATIONS

NFS Considerations NFSv3 is the ubiquitous protocol for EDA clients accessing storage. This is due to the maturity of the protocol version, ease of implementation, and wide availability of client and server stacks.

There are some important EDA configuration settings to keep in mind when using Isilon with NFS clients in an EDA environment:

32-Bit File IDs Firstly, many EDA tools still require returning a 32bit file ID. Given this, configuring OneFS to explicitly return 32bit file ID’s is an imperative tuning step. To set this from the WebUI, navigate to:

Protocols -> UNIX Sharing (NFS) -> Export Settings -> Advanced Default Export Settings -> Client Compatibility Settings

And configure ‘Return 32 bit File IDs’ to ‘Use Custom’ with value ‘Yes’:

Figure 15. Configuring 32-bit file IDs

Client NFS Mount Settings For NFS3 and NFS4, the maximum read and write sizes (rsize and wsize) are 1 MB. When you mount NFS exports from a cluster, a larger read and write size for remote procedure calls can improve throughput. The default read size in OneFS is 128 KB. An NFS client uses the largest supported size by default. Setting the value too small on a client overrides the default value and can undermine performance.

For EDA workloads, the recommendation is to avoid explicitly setting NFS rsize or wsize parameters on NFS clients when mounting Isilon NFS exports directly, or via the automounter. Instead, for NFSv3 clients, use the following mount parameters, either directly or in the NIS Automounter ‘auto master’ map:

mount -vers=3,rw,tcp,hard,intr,retry=2,retrans=5,timeo=600

With NFS clients that support READDIRPLUS, this call can improve performance by ‘prefetching’ file handle, attribute information, and directory entries – plus information to allow the client to request additional directory entries in a subsequent readdirplus transaction. This relieves the client from having to query the server for that information separately for each entry.

For an environment with a high file count, try setting the readdirplus prefetch to a value higher than the default value of 10. For a low file count environment, you can experiment with setting it lower than the default. In a workload that runs concurrent jobs, consider testing your changes until you find the value that works best for the environment.

More information about readdirplus can be found in the EMC knowledge base article number emc14001899 “Directory listing is slow with a large amount of files using Linux clients”.

Another recommendation for EDA is to use asynchronous (async) mounts from the client. Conversely, using sync as a client mount option makes all write operations synchronous, usually resulting in poor write performance. Sync mounts should be used only when a client program relies on synchronous writes without specifying them.

Optimal Thread Count Prior to OneFS 7.2, the number of threads used by the NFS server running on the Isilon cluster can be manually adjusted to help enhance performance. This is especially useful in large, distributed compute environments with many NFS client connections.

To configure, adjust the sysctl variable vfs.nfsrv.rpc.maxthreads from the CLI of any node as so:

# isi_sysctl_cluster vfs.nfsrv.rpc.maxthreads=96 # isi_sysctl_cluster vfs.nfsrv.rpc.minthreads=96

The number of threads available to the NFS server can be increased, but is dependent on the amount of available RAM. This is recommended when operating in a dedicated NFS environment, or when the primary load on the cluster is strongly NFS biased. On a heavily loaded cluster, the side effect of any increase in the number of NFS server threads can potentially be an increase in response times to new client requests.

Note: For OneFS 7.2 and beyond, the NFS server thread count is dynamically allocated and auto-tuning. As such, the max and minthreads sysctls above are no longer present.

NFS Connection Count As a conservative best practice, active NFS connections should be kept under 1,000, where possible. Although no maximum limit for NFS connections has been established, the number of available TCP sockets can limit the number of NFS connections. The number of connections that a node can process depends on the ratio of active-to-idle connections as well as the resources available to process the sessions. Monitoring the number of NFS connections to each node helps prevent overloading a node with connections.

NFSv3 vs. NFSv4 NFSv4 can provide some benefits over NFSv3 due to a slight performance advantage when working with large clustered environments accessing a common resource:

NFSv4 provides several new features as well as improvements on the NFSv3 architecture. When working with a large distributed data management infrastructure, NFSv4 provides the following major advantages over v3:

• A more ‘stateful’ implementation

• Ability to bundle metadata operations • An integrated, more functional lock manager • Conditional file delegation

NFSv4 now requires that all network traffic management (congestion, retransmits, timeouts) be handled by the underlying transport protocol as opposed to the application layer as found in v3. In high volume, high throughput workflows, this helps free up the client for additional application specific work on the data.

NFSv4 also has the ability to bundle metadata operations using compound RPCs (Remote Procedure Calls), which reduce the overall number of metadata operations and significantly decreases the overhead required when accessing multiple files. This can be a significant factor for data management frameworks, which often require access to hundreds, even thousands of files to satisfy client requests.

An integrated lock manager provides lock leasing and lock timeouts – a considerable improvement over the previously used NLM in NFSv3, which only provided a limited implementation of these features out of band. This makes for cleaner recovery semantics and processes for failure handling.

File delegation is another new feature in NFSv4 in which the server provides a conditional ‘exclusive’ lock to the client for file operations.

Note: NFSv4 is still nascent in EDA environments. None of the major EDA application vendors (Synopsys, Cadence, and Mentor) officially endorse it, nor do they use it in-house yet.

Synchronous and Asynchronous Export and Mount Options Configure NFS exports for asynchronous commit to maximize performance.

DATA AVAILABILITY AND PROTECTION RECOMMENDATIONS

Availability and recovery objectives At the core of every effective data protection strategy lays a solid business continuance plan. EDA shops need an explicitly defined and routinely tested plan to minimize the potential impact to the workflow when a failure occurs or in the event of a natural disaster.

Among the primary approaches to data protection are fault tolerance, redundancy, snapshots, replication (local and/or geographically separate), and backups to nearline storage, VTL, or tape.

Some of these methods are biased towards cost efficiency but have a higher risk associated with them, and others represent a higher cost but also offer an increased level of protection. Two ways to measure cost versus risk from a data protection point of view are:

• Recovery Time Objective (RTO): RTO is the allotted amount of time within a Service Level Agreement (SLA) to recover data. For example, an RTO of four hours means data must be restored and made available within four hours of an outage.

• Recovery Point Objective (RPO): RPO is the acceptable amount of data loss that can be tolerated per an SLA. With an RPO of 30-minutes, this is the maximum amount of time that can elapse since the last backup or snapshot was taken.

The availability and protection of data can be usefully illustrated in terms of a continuum:

Figure 16. Isilon Data Protection technology alignment with protection continuum

At the beginning of the continuum sits high availability. This requirement is usually satisfied by redundancy and fault tolerant designs. The goal here is continuous availability and the avoidance of downtime by the use of redundant components and services.

Further along the continuum lie the data recovery approaches in order of decreasing timeliness. These solutions typically include a form of point-in-time snapshots for fast recovery, followed by synchronous and asynchronous replication. Finally, backup to tape or a virtual tape library sits at the end of the continuum, providing insurance against large scale data loss, natural disasters and other catastrophic events.

EDA customers often use snapshots to back up the data for short-term retention and to satisfy low recovery objective SLAs. Replication of data from the primary cluster to a target DR cluster, ideally located at a geographically separate location, is strongly recommended. NDMP backup to tape or VTL (virtual tape library) typically satisfies longer term high recovery objective SLAs and any regulatory compliance requirements.

Snapshot Considerations Snapshots always carry a trade-off between cluster resource consumption (CPU, memory, disk), the potential for data fragmentation, and the benefit of increased data availability, protection, and recovery.

SnapshotIQ creates snapshots at the directory-level instead of the volume-level, thereby providing improved granularity. There is no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or little of the available file system space as desirable.

• Quotas can be used to calculate a file and directory count that includes snapshot revisions, provided the quota is configured to include snaps in its accounting via the “--snaps=true” configuration option.

• SnapshotDelete will only run if the cluster is in a fully available state, i.e., no drives or nodes are down. • A snapshot schedule cannot span multiple days: To generate snapshots from 5:00 PM Monday to 5:00 AM Tuesday, create one

schedule that generates snapshots from 5:00 PM to 11:59 PM on Monday, and another schedule that generates snapshots from 12:00 AM to 5:00 AM on Tuesday.

• If a directory is moved, you cannot revert any snapshots of that directory which were taken prior to its move.

The following table provides a suggested snapshot schedule for both ordered and unordered deletion configurations:

Figure 17. Snapshot Schedule Recommendations

For optimal cluster performance, Isilon recommends observing the following SnapshotIQ best practices.

• Use an ordered snapshot deletion strategy where viable. • Configure the cluster to take fewer snapshots, and for the snapshots to expire more quickly, so that less space will be consumed

by old snapshots. Take only as many snapshots as you need, and keep them active for only as long as you need them. • Using SmartPools, snapshots can physically reside on a different disk tier than the original data. The recommendation, however, is

to keep snapshots on the same tier on which they were taken. • The default snapshot limit is 20,000 per cluster, and recommend limiting snapshot creation to 1,024 per directory. • Avoid creating snapshots of directories that are already referenced by other snapshots. • It is recommended that you do not create more than 1000 hard links per file in a snapshot to avoid performance degradation. • Creating snapshots of directories higher on a directory tree will increase the amount of time it takes to modify the data referenced

by the snapshot and require more cluster resources to manage the snapshot and the directory. • Avoid taking snapshots at /ifs level. For EDA, taking snapshots at per dataset level is recommended. Snapshots taken at parent

dataset (ex: /ifs/site/projects) directory level enable faster snapshot deletions and avoids management complexities. • The recommendation is not to disable the snapshot delete job, since this prevents unused disk space from being freed and can

also cause performance degradation. • If you need to delete snapshots and there are down or smartfailed components, or the cluster is in an otherwise degraded state,

contact Isilon Technical Support for assistance. • If you intend on reverting snapshots for a directory, it is recommended that you create SnapRevert domains for those directories

while the directories are empty. Creating a domain for a directory that contains less data takes less time. • Delete snapshots in order, beginning with the oldest. Where possible, avoid deleting snapshots from the middle of a time range. • Configure the SSD Strategy to “Use SSDs for metadata read/write acceleration” for faster snapshots deletes. • Do not delete SyncIQ snapshots (snapshots with names that start with SIQ), unless the only remaining snapshots on the cluster

are SyncIQ snapshots, and the only way to free up space is to delete those SyncIQ snapshots.

Replication Considerations Isilon SyncIQ delivers high-performance, asynchronous replication of unstructured data to address a broad range of recovery point objectives (RPO) and recovery time objectives (RTO). This enables customers to make an optimal tradeoff between infrastructure cost and potential for data loss if a disaster occurs. SyncIQ does not impose a hard limit on the size of a replicated file system so will scale linearly with an organization’s data growth up into the multiple petabyte ranges.

SyncIQ is easily optimized for either LAN or WAN connectivity in order to replicate over short or long distances, thereby providing protection from both site-specific and regional disasters. Additionally, SyncIQ utilizes a highly-parallel, policy-based replication

architecture designed to leverage the performance and efficiency of clustered storage. As such, aggregate throughput scales with capacity and allows a consistent RPO over expanding datasets.

A secondary cluster synchronized with the primary production cluster can afford a substantially improved RTO and RPO than tape backup and both implementations have their distinct advantages. And SyncIQ performance is easily tuned to optimize either for network bandwidth efficiency across a WAN or for LAN speed synchronization. Synchronization policies may be configured at the file-, directory- or entire file system-level and can either be scheduled to run at regular intervals or executed manually.

By default, a SynclQ source cluster can run up to fifty concurrent replication jobs under OneFS 8.0. For versions prior to 8.0, this limit is 5 consecutive jobs. OneFS queues additional jobs until a job execution slot becomes available. You can cancel jobs that are in the queue.

SyncIQ policies also have a priority setting to allow favored policies to preempt others. In addition to chronological scheduling, replication policies can also be configured to start whenever the source is modified (change based replication). If preferred, a delay period can be added to defer the start of a change based policy.

Bear in mind the following SyncIQ best practices:

• The recommended limit of running SyncIQ policies is 50 policies for a cluster with 4 or more nodes. • While the maximum number of workers per node per policy is eight, the default and recommended number of workers per node is

three. • The recommended limit of workers per replication policy is 40. • Ensure that he target cluster is running the same or a later version of OneFS as the source cluster. • Establish reference network performance using common tools such as scp or NFS copy from cluster to cluster. This will provide a

baseline for a single threaded data transfer over the network. • After creating a policy and before running the policy for the first time, use the policy assessment option to see how long it takes to

scan the source cluster dataset with default settings. • Increase workers per node in cases where network utilization is low. This can help overcome network latency by having more

workers generate I/O on the wire. If adding more workers per node does not improve network utilization, avoid adding more workers because of diminishing returns and worker scheduling overhead.

• Increase workers per node in datasets with many small files to push more files in parallel. Be aware that as more workers are employed, more CPU is consumed.

• Use SyncIQ file throttling to roughly control how much CPU and disk I/O replication consumes while jobs are running. • Consider using SmartConnect pools to constrain replication to a dedicated set of cluster network interfaces. • Use SyncIQ network throttling to control how much network bandwidth SyncIQ can consume.

NDMP Backup Considerations At the trailing end of the protection continuum lies traditional backup and restore—whether to tape or disk. With high RPO and RTOs, this is the bastion of any data protection strategy and usually forms the crux of a ‘data insurance policy’.

In environments using SyncIQ to replicate data from a primary source cluster to a DR target cluster, performing NDMP backups on the target cluster is preferred practice. In this way, all the resources required to run the backups (CPU, memory, I/O, network) fall to the target, freeing the primary cluster up to service client I/O. Use Backup Accelerator to FC tape libraries where possible.

Direct NDMP Direct NDMP (2-way NDMP) is the most efficient model and results in the fastest transfer rates. Here, the data management application (DMA) uses NDMP over the Ethernet front-end network to communicate with the Backup Accelerator. On instruction, the Backup Accelerator, which is also the NDMP tape server, begins backing up data to one or more tape devices which are attached to it via Fibre Channel.

The Backup Accelerator is an integral part of the Isilon cluster and communicates with the other nodes in the cluster via the internal Infiniband network. The DMA, a separate server, controls the tape library’s media management. File History, the information about files and directories, is transferred from the Backup Accelerator via NDMP to the DMA, where it is maintained in a catalog.

Direct NDMP is the fasted and most efficient model for backups with OneFS and obviously requires one or more Backup Accelerator nodes to be present within a cluster.

Figure 18. Recommended Two-way NDMP with Backup Accelerator

Remote NDMP With remote, or 3-way, NDMP there is no Backup Accelerator present. In this case, the DMA uses NDMP over the LAN to instruct the cluster to start backing up data to the tape server - either connected via Ethernet or directly attached to the DMA host. In this model, the DMA also acts as the Backup/Media Server.

During the backup, file history is transferred from the cluster via NDMP over the LAN to the backup server, where it is maintained in a catalog. In some cases, the backup application and the tape server software both reside on the same physical machine.

Figure 19. Remote Three-way NDMP Backup

For optimal performance, please consider the following best practices:

• The number of NDMP connections per node should not exceed 64 connections. • NDMP to tape or virtual tape should ideally be performed on the secondary DR cluster, freeing up resources on the primary cluster

to satisfy client IO.

• Where possible, use an Isilon backup accelerator node to connect to a Fibre Channel tape library or VTL. The ideal ratio is one backup accelerator per three nodes.

• If running 3-way NDMP on a primary cluster, constrain it to its own dedicated interfaces on a separate SmartConnect zone, ideally on a lower priority storage tier.

• Enable parallelism for the DMA if the DMA supports this option. This allows OneFS to back up data to multiple tape devices at the same time.

• Run a maximum of eight NDMP concurrent sessions per A100 Backup Accelerator node and four NDMP concurrent sessions per Isilon IQ Backup Accelerator node to obtain optimal throughput per session.

• NDMP backups result in very high Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs). You can reduce your RPO and RTO by attaching one or more Backup Accelerator nodes to the cluster and then running two-way NDMP backups.

• The throughput for an Isilon cluster during the backup and recovery operations is dependent on the dataset and is considerably reduced for small files.

• If you are backing up large numbers of small files, set up a separate schedule for each directory, or high level directory. • If you are performing NDMP three-way backups, run multiple NDMP sessions on multiple nodes in your Isilon cluster. • Recover files through Direct Access Restore (DAR), especially if you recover files frequently. However, it is recommended that you

do not use DAR to recover a full backup or a large number of files, as DAR is better suited to restoring smaller numbers of files. • Recover files through Directory DAR (DDAR) if you recover large numbers of files frequently. • If possible, do not include or exclude files from backup. Including or excluding files can affect backup performance, due to filtering

overhead.

DATA MANAGEMENT RECOMMENDATIONS

Quota Considerations The OneFS SmartQuotas module tracks disk usage with reports and enforces storage limits with alerts.

SmartQuotas best practices include:

• Avoid creating quotas on the root directory of the default OneFS share (ifs). • Governing a single directory with overlapping quotas can also degrade performance. • Directory quotas can also be used to alert of and constrain runaway jobs, preventing them from consuming massive amounts of

storage space.

Within OneFS, quota data is maintained in Quota Accounting Blocks (QABs). Each QAB contains a large number of Quota Accounting records, which need to be updated whenever a particular user adds or removes data from a filesystem on which quotas are enabled. If a large number of clients are accessing the filesystem simultaneously, these blocks can become highly contended and a potential bottleneck.

To address this, quota accounts have a mechanism to avoid hot spots on the nodes storing QABs. Quota Account Constituents (QACs) help parallelize the quota accounting by including additional QAB mirrors on other nodes.

The following sysctl increases the number of quota accounting constituents, which allows for better scalability and reduces latencies on create/delete flurries when quotas are used.

Using this parameter, the internally calculated QAC count for each quota is multiplied by the specified value. If a workflow experiences write performance issues, and it has many writes to files or directories governed by a single quota, then increasing the QAC ratio may improve write performance.

Change the sysctl efs.quota.reorganize.qac_ratio to value 8 from the default value of 1:

# isi_sysctl_cluster efs.quota.reorganize.qac_ratio=8

To verify persistent change:

# cat /etc/mcp/override/sysctl.conf | grep qac_ratio efs.quota.reorganize.qac_ratio=8 #added by script

PERMISSIONS, AUTH AND ACCESS CONTROL RECOMMENDATIONS

NIS and Access Zones Best Practices A minimum of two NIS servers provides redundancy and helps avoid access control lookups being a bottleneck. For larger environments, scaling the number of NIS servers may be required.

The maximum number of supported NIS domains is 50.

Although you can specify multiple NIS domains in an access zone, NFS users benefit only from the NIS configuration defined in the system access zone.

As a best practice, the number of access zones should not exceed 50. The number of local users and groups per cluster should not exceed 25,000 for each. While possible, creating a larger number of local groups and/or users may affect system performance.

Group Owner Inheritance

Many EDA customers use group IDs to protect their various projects’ data. In order to configure group IDs to behave the way EDA prefers, the following ACL configuration change on the cluster is recommended.

To configure from the WebUI, navigate to Access -> ACL Policy Settings -> Group Owner Inheritance and select ‘Linux and Windows semantics – Inherit group owner from the creator’s primary group’:

Figure 20. Group owner inheritance configuration

This allows for the child entities created to inherit parent’s attribute of effective-GID, rather than always inheriting the primary-GID.

JOB ENGINE CONSIDERATIONS The default, global limit of 3 jobs does not include jobs for striping or marking; one job from each of those categories can also run concurrently.

When using metadata read or metadata write acceleration, always run a job with the *LIN suffix where possible. For example, favor the FlexprotectLIN job, rather than the regular Flexprotect job.

Avoid data movement as much as possible during EDA daily operations. SmartPools data placement at this scale entails a costly overhead which contends with client IO. With a mixed node cluster where data tiering is required, the recommendation is to schedule the SmartPools job to run during off-hours (nights and/or weekends) when the client activity is at its lowest.

For optimal cluster performance, Isilon recommends observing the following Job Engine best practices.

• Schedule jobs to run during the cluster’s low usage hours – overnight, weekends, etc. • Where possible, use the default priority, impact and scheduling settings for each job. • To complement the four default impact profiles, create additional profiles such as “daytime_medium”, “after_hours_medium”,

“weekend_medium”, etc, to fit specific environment needs. • Ensure the cluster, including any individual node pools, is less than 90% full, so performance is not impacted and that there’s

always space to re-protect data in the event of drive failures. Also enable virtual hot spare (VHS) to reserve space in case you need to smartfail devices.

• If SmartPools is licensed, ensure that spillover is enabled (default setting). • Configure and pay attention to alerts. Set up event notification rules so that you will be notified when the cluster begins to reach

capacity thresholds, etc. Make sure to enter a current email address to ensure you receive the notifications. • It is recommended not to disable the snapshot delete job. In addition to preventing unused disk space from being freed, disabling

the snapshot delete job can cause performance degradation. • Delete snapshots in order, beginning with the oldest. Do not delete snapshots from the middle of a time range. Newer snapshots

are mostly pointers to older snapshots, and they look larger than they really are. • If you need to delete snapshots and there are down or smartfailed devices on the cluster, or the cluster is in an otherwise

“degraded protection” state, contact Isilon Technical Support for assistance. • Only run the FSAnalyze job if you are using InsightIQ and require filesystem analytics. FSAnalyze creates data for InsightIQ’s file

system analytics tools, providing details about data properties and space usage within /ifs. Unlike SmartQuotas, FSAnalyze only updates its views when the FSAnalyze job runs. Since FSAnalyze is a fairly low-priority job, it can sometimes be preempted by higher-priority jobs and therefore take a long time to gather all of the data.

• Schedule deduplication jobs to run every 10 days or so, depending on the size of the dataset. • In a heterogeneous cluster, tune job priorities and impact policies to the level of the lowest performance tier. • For OneFS versions prior to OneFS 7.1, pause any active or scheduled SnapshotDelete jobs before adding new nodes to the

cluster to allow MultiScan (default lower priority) to complete without interruption.

• Before running a major (non-rolling) OneFS upgrade, allow active jobs to complete, where possible, and cancel out any outstanding running jobs.

• Before running TreeDelete, ensure there are no quotas policies set on any directories under the root level of the data for deletion. TreeDelete cannot delete a directory if a quota has been applied to it.

• If FlexProtect is running, allow it to finish completely before powering down any node(s), or the entire cluster. While shutting down the cluster during restripe won't hurt anything directly, it does increases the risk of a second device failure before Flexprotect finishes re-protecting data.

CLUSTER MANAGEMENT RECOMMENDATIONS There are three access methods for configuring and administering an Isilon cluster:

• OneFS command line interface (CLI) - either via SSH or serial console. • OneFS web interface (WebUI) • OneFS RESTful platform API (PAPI)

While the Web Interface is the most intuitive, menu driven, and simple to use cluster administration method, it is also the most limited in terms of scope. The CLI has a more comprehensive set of administrative commands that the WebUI, making it a popular choice for OneFS power users.

However, where possible the recommendation is use scripts to automate management of the cluster via the platform API. This also avoids challenges with the CLI and WebUI in parsing large numbers of configuration policies – for example, tens of thousands of NFS exports.

PERFORMANCE TUNING

Understanding Consolidated EDA Workflow From the storage point of view, an EDA workload can be categorized into three primary classes: reads, writes, and metadata operations.

Examining the workflow of EMC Isilon customers in the EDA industry found that the workload is attribute-intensive because of the large number of files stored in a deep and wide directory structure. To serve I/O requests with the kind of performance that EDA tools require, the storage system should ideally hold the metadata for the working data set in cache (memory), or at least in the highest performance tier. The overall I/O profile for an EDA workload is typically metadata and write biased, and often has the following breakdown:

Figure 21. EDA workload categorization

As such, an EDA workload can be characterized as having a typical mix of 65 percent metadata, 20 percent writes, and 15 percent reads.

Measuring Cluster Performance Before performing any tuning an Isilon cluster or NFS clients, you should analyze how the various EDA workloads in your environment interact with and consume storage resources. This is done by gathering statistics about your common file sizes and I/O operations, including CPU and memory load, network traffic utilization, and latency. To obtain key metrics and wall-clock timing data for delete, renew lease, create, remove, set userdata, get entry, and other file system operations, connect to a node via SSH and run the following command as root to enable the vopstat system control:

# sysctl efs.util.vopstats.record_timings=1

After enabling vopstats, you can view them by running the following command as root:

sysctl efs.util.vopstats

Here is an example of the command’s output:

efs.util.vopstats.ifs_snap_set_userdata.initiated: 26 efs.util.vopstats.ifs_snap_set_userdata.fast_path: 0 efs.util.vopstats.ifs_snap_set_userdata.read_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.read_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_read_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_read_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_write_bytes: 0 efs.util.vopstats.ifs_snap_set_userdata.raw_write_ops: 0 efs.util.vopstats.ifs_snap_set_userdata.timed: 0 efs.util.vopstats.ifs_snap_set_userdata.total_time: 0 efs.util.vopstats.ifs_snap_set_userdata.total_sqr_time: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_timed: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_total_time: 0 efs.util.vopstats.ifs_snap_set_userdata.fast_path_total_sqr_time: 0

The time data captures the number of operations that cross the OneFS clock tick, which is 10 milliseconds. Independent of the number of events, the total_sq_time provides no actionable information because of the granularity of events. To analyze the operations, use the total_time value instead. The following example shows only the total time records in the vopstats:

sysctl efs.util.vopstats | grep –e "total_ time: [ ^0]" efs.util.vopstats.access_rights.total_time: 40000 efs.util.vopstats.lookup.total_time: 30001 efs.util.vopstats.unlocked_ write_mbuf.total_time: 340006 efs.util.vopstats.unlocked_write_mbuf.fast_path_total_time: 340006 efs.util.vopstats.commit.total_time: 3940137 efs.util.vopstats.unlocked_getattr.total_time: 280006 efs.util.vopstats.unlocked_getattr.fast_path_total_time: 50001 efs.util.vopstats.inactive.total_time: 100004 efs.util.vopstats.islocked.total_time: 30001 efs.util.vopstats.lock1.total_time: 280005 efs.util.vopstats.unlocked_read_mbuf.total_time: 11720146 efs.util.vopstats.readdir.total_time: 20000 efs.util.vopstats.setattr.total_time: 220010 efs.util.vopstats.unlock.total_time: 20001 efs.util.vopstats.ifs_snap_delete_resume.timed: 77350 efs.util.vopstats.ifs_snap_delete_resume.total_time: 720014 efs.util.vopstats.ifs_snap_delete_resume.total_sqr_time: 7200280042

The following command provides NFS v3 statistics per protocol operation, client connections, and the file system:

# isi statistics pstat --protocol=nfs3

You can also run the isi statistics command with the client argument to view I/O information and timing data by client:

# isi statistics client

And you can run the isi statistics command with the system argument to view CPU utilization by protocol. For example:

# isi statistics system Node CPU SMB FTP HTTP I SCSI NFS HDFS Total NetIn NetOut DiskIn DiskOut LNN %Used B/ s B/ s B/ s B/ s B/ s B/ s B/ s B/ s B/ s B/ s B/ s All 18. 8 41K 0. 0 16K 26K 134M 0. 0 135M 37M 101M 178M 289M

InsightIQ Implications InsightIQ is an off-cluster trending and reporting tool that runs on Linux as either a virtual machine or on dedicated x86 physical hardware. Version 3.0 and later of InsightIQ can monitor up to 8 clusters, with a maximum of 80 nodes for any one cluster, and up to 150 nodes total across multiple clusters.

InsightIQ comprises two main areas of monitoring:

• Performance reporting • File system reporting

The performance section focuses on monitoring and reporting current and historical network, protocol, and cluster resource utilization. This is incredibly useful data for troubleshooting and capacity planning complex storage infrastructures. It is also a very low overhead facility, since it consumes little in the way of cluster resources to sample and extract data points.

The file system reporting section provides detailed capacity, deduplication and quota reporting, plus a file system analytics component. The latter provides an array of data on the data, including “top 1000 by…” file and directory reports covering data age, size, etc. While this file system introspective data can be very valuable, it comes at a performance price. In order to gather the statistics, the FSAnalyze job must run on the cluster, and this job can be cluster resource intensive, both in terms of disk IO and CPU overhead. As such, the best practice for EDA workloads is to disable this job on the cluster, and utilize InsightIQ just for performance reporting.

Other performance statistics are also available on cluster, both from the OneFS command line interface (CLI) in the web user interface (WebUI). For the various methods of obtaining statistics about the storage system, see the “OneFS Administration Guide” and the “OneFS Command Reference.”

PERFORMANCE TUNING System performance tuning is a much art as science. Key performance questions include which types of operations consume the most CPU or memory, or result in the most access latency? Create and remove operations, in particular, can be CPU heavy. But whether these operations are affecting performance will depends on your workflow and working set, and can only be determined by diligent performance analysis.

While there are recommendations that we provide, the measure of success is demonstrable benefit in your environment via quantifiable, repeatable measuring and benchmarking.

Here are some recommendations and tunable parameters that can help with this process:

• Ensure that all nodes are equal in memory. This governs how many vnodes and LK locks are available. • Adjust the LISTEN QUEUE soconnmax from its default of 128 to 512 and TcpListenQueueLength from 32 to 512.

# sysctl isi_sysctl_cluster kern.ipc.somaxconn=512 # isi_gconfig registry.Services.lwio.Parameters.Drivers.nfs.TcpListenQueueLength=512 # isi services nfs disable # isi services nfs enable

To verify the new queue length settings have been applied, run:

# netstat -aL | egrep -i "nfs|mount|lockd|statd" | grep “512”

These changes increase the number of connections allowed to a cluster’s nodes and can guard against resource starvation in environments with large numbers of client connections.

For example, if you have one thousand client connections to a particular node and that node goes down, the protocol listeners will be hit hard as clients attempt to connect to other nodes. If a cluster cannot accept connections quickly enough, some clients will have success connection attempts, whereas others will fail to connect. In this case, increasing the listen queue length will help mitigate this behavior.

• Configure Metadata R/W on SSD policy on Node Pools containing enough SSD (6-10% of HDD ratio). • The 6-10% SSD ratio is needed to adequately house all the metadata mirrors on a node pool that’s configured for metadata-

read/write acceleration. The implications of running out of SSD capacity here are significant, since OneFS will be forced to spill metadata writes over to hard drives instead. This can lead to unpredictable performance on accessing metadata objects, since most will be on low latency SSD, while some will need to be read from and written to much higher latency HDDs.

CLUSTER SIZING GUIDELINES

Data Lake vs. Pod Architecture There are two main paths to pursue when designing an Isilon clustered storage architecture for an EDA environment.

• Data lake • Pod architecture

The data lake approach, where one cluster supports all the workflows and performance profiles, initially has favorable efficiencies of scale and ease of administration. However, there are trade-offs as the cluster size increases.

The larger the cluster, the more components are involved, which means the likelihood of component failure increases. At some point, the infrastructure becomes complex enough that there’s more often than not a drive a Smartfailing or a node in a degraded state. At this

point, cluster group changes and drive rebuilds/data reprotects will occur frequently enough that they can start to become a workflow interruption.

There's no two ways about it; higher rates of protection are required for large clusters, which has an impact on capacity utilization.

The definition of a large cluster here is forty nodes or more. At forty nodes and above, SmartPools automatically partitions a node-pool into two smaller twenty node pools. The thresholds are that every node-pool of equivalent hardware in a cluster is broken up into disk-pools of:

• Ideally 20 nodes • No more than 6 drives per node

The ideal practice for demanding EDA workloads is to create manual node pools limited to 20 nodes when you plan on having 40 or more nodes of the same type in a large cluster. This allows the level of data protection to be lowered (for example from +3n to +2n), which affords significant capacity overhead, plus I/O and OPS reduction, as discussed earlier.

CASE STUDIES

Customer A A large, fabless design house in California decided to first start deploying Isilon over five years ago. The idea was to supplement an existing scale-up NAS infrastructure with scale-out architecture for a mixed set of business and technical reasons. The business justification was to have a split strategy between vendors, and the technical justifications were that the scale-up architecture was no longer sufficient from performance and scalability reasons. From a technical angle, the compute grid had become denser, cheaper and faster, and the bottleneck had shifted towards the storage. Engineers running a large number of jobs in parallel against a given scratch or a project directory often found themselves being I/O constrained in the traditional scale-up architecture. Specifically, the challenge was the saturation of the NAS controller head responsible for serving the protocol operations to large number of concurrent clients. To resolve this technical bottleneck, this EDA company examined over a dozen tier one NAS solutions. After cursory evaluation via vendor run benchmarks, matching high-level feature set assessment and generally defined soft-criteria such as global presence, they brought top three solutions in-house for a POC. After a thorough evaluation of all three solutions, they ultimately decided to qualify Isilon as the next-generation EDA research and development storage platform.

The evaluation process consisted of running read-write throughput tests with traditional throughput assessment tools. It was concluded that throughput based assessment alone was not sufficient, as the gains seen in these throughput assessment tests were not translating into similar gains for EDA production workload test cases. Furthermore, running select EDA production test cases revealed that most test-cases had a huge variance in types and mix of operations. Most of the EDA specific tests didn’t reveal huge difference in runtime results. A home-grown benchmark was created which generated a large set of files and directories representative of EDA like directory structure. The benchmark then exercised a random set of operations that would generate a similar I/O profile as a mix of multiple EDA workloads running concurrently as is typically the case in EDA production. Isilon demonstrated increased scalability with this benchmark. A greater number of concurrent threads could be employed to carry out a given task and a set of operations against a given single parent directory and it would not saturate the NAS as compared to scale-up architecture which would show a cliff and a drop-off after certain level of concurrency was induced. Additionally it showed that latency contributed to the results. While SATA vs. SAS didn’t provide noticeable difference in throughput tests, for the home-grown benchmark with increased concurrency, lower latencies resulted in improved runtimes.

The roll-out was done in a manner that would seed the net-new organic growth of new chip design data requirements on Isilon. The idea was to host new projects and scratch areas on Isilon. A business strategy was put in place to grow the footprint over time, while allowing time for the solution to bake. Original plan by the customer was to grow the cluster into a large number of nodes to well beyond a hundred nodes. During the POC, the customer recognized that it would be practical to design multiple failure domains around the solution from an architectural implementation perspective to further increase resiliency. A technical foundation was already in place to combine a federated namespace across multiple NAS entities using the automounter and present that namespace as a single virtual namespace on the client side. Customer decided to continue to leverage that while establishing best practices around what would be an optimally sized Isilon cluster that would allow for certain amount of consolidation ratio compared to their existing scale-up architecture, while still providing sufficient amount of resiliency. In the manner, they run a split NAS strategy approach, spreading their project and scratch datasets across a small number of storage Pods.

The company was subsequently acquired by another company that had established Isilon as their NAS standard for all datasets. The acquiring entity enforced their Isilon standards on the acquired company discussed above. As a result, all EDA datasets, including home, tools and IP are being transitioned to Isilon. Architecturally, home directories will be spread across multiple clusters. This ensures that any single cluster event doesn’t impact the entire user community but a subset of those users constrained to that cluster. Tools and IP would be centralized to a single cluster with a mirror copy to another cluster, as both of those datasets require minimal capacity while require great amount of resiliency. Since tools and IP dataset are primarily read-only datasets, multiple copies are further

‘round-robined’ for access via automounter to increase resiliency per entry. Despite the pod architecture, complete refresh at large centers will result in 10x consolidation ratio compared to previous number of NAS end-points.

Customer B A large IP vendor based out of UK has been an Isilon customer for many years. They were one of the first ones to embrace Isilon in the EDA industry as early as with OneFS 5.x release. When they first looked at the Isilon solution, some of the verification flows demonstrated increased throughput, allowing the runtimes to shrink for that category of jobs. They were attracted to the large single namespace and ultimately grew the cluster to over eighty nodes. Testing showed that even though the synthetic tools were demonstrating similar performance between S and X nodes, for EDA workloads, S-nodes with SAS drives were exhibiting lower latencies in production under stress. As such, the large cluster consists of 70 plus S-nodes and less than ten X-nodes. All EDA production workloads are run against the S-nodes, and X-nodes are only used as a down-tier archive platform for projects that need to stay online for reference but are no longer active. Furthermore, to ensure further resiliency for the large set of S nodes, multiple nodepools consisting of only 5 nodes per nodepool, were created to minimize failure domains.

Since EDA dataset typically consist of 95+% small files, having a shortened stripe-width with only 5 nodes per nodepool was adequate. This allowed them to maintain default data protection per nodepool. To date, they continue this practice on new clusters that are deployed. This has allowed them to avoid having to go to higher data protections on larger nodepools, which would have required an additional metadata mirror, plus an additional copy of data blocks for small files. From their experience of one large cluster of eighty plus nodes, it was inferred that further failure domain creation by creating a pod like architecture would help minimize the impact.

As such the next-gen cluster deployment architecture standardized on S210 and X410 nodes confided into a pod with a maximum of forty nodes. Initial deployment of the pod could start at as little as less than twenty S210 nodes, while the down-tiered data would be on X410 nodes with the node count starting at as little as 4 nodes. The pods aim to take maximum advantage of the S-platform to ensure low latency throughput for EDA jobs. The X-nodes serve as a platform to down-tiered aged and inactive data.

EDA Performance Best Practices Checklist For optimal cluster performance, Isilon recommends observing the following EDA best practices. Please note that this information may be covered elsewhere in this paper.

Run OneFS 7.2.1 or later, with the following recommended cluster firmware versions:

Disk Firmware: DSP 1.9 / DFP 1.12, Node Firmware: 9.3.2

Add cluster to an InsightIQ monitoring instance

Disable the FSA job unless filesystem reporting is explicitly required.

Setup SmartConnect for load balancing of EDA jobs and use Round Robin as the balancing policy.

Configure the metadata write on SSD strategy on Node Pools containing enough SSD (6-10% of HDD ratio).

Disable SmartPools Job for a single tier system. For a mixed node cluster, ensure the SmartPools job only runs during off-hours.

If you are using group IDs to protect project data, configure group owner inheritance to ‘Inherit group owner from the creator’s primary group’.

Configure NFS exports for asynchronous commit to maximize performance.

Configure NFS client compatibility settings to ‘Return 32 bit File IDs’.

Disable the Endurant Cache: # isi_sysctl_cluster efs.bam.ec.mode=0

If you have licensed and configured SmartQuotas, increase the quota accounting block scaling ratio: # isi_sysctl_cluster efs.quota.reorganize.qac_ratio=8

If using SmartPools, reconfigure the Storage Target field from “anywhere” to a specific tier or node pool (ie. “s210_13tb_1.6tb-ssd_128gb”) to direct ingest to the most performant tier.

If you are using SnapshotIQ with OneFS 7.1.1 or earlier, configure the SSD Strategy to “Use SSDs for metadata read/write acceleration” for faster snapshots deletes.

Do not explicitly configure rsize/wsize on clients’ NFS mounts. Instead, use the following mount option in your auto.master per map, or per entry within a map: mount -vers=3,rw,tcp,hard,intr,retry=2,retrans=5,timeo=600

Create manual pools limited to 20 nodes when you plan on having 40 or more nodes of the same type in a large cluster. This allows data protection to be lowered, which affords significant capacity overhead, I/O and operations per second (OPS) reduction.

Ensure that cluster capacity utilization (HDD and SSD) remains below 90%.

SUMMARY EMC Isilon overcomes the problems that undermine traditional NAS systems by combining the three traditional layers of storage architecture—file system, volume manager, and data protection—into a scale-out NAS cluster with a distributed file system. For electronic design automation, the Isilon scale-out architecture eliminates controller bottlenecks to reduce wall-clock runtimes for concurrent jobs, accelerates metadata operations, improves storage efficiency to lower capital expenditures, centralizes management to reduce operating expenses, and delivers strategic advantages to reduce a chip's time to market.

storage best practices for electronic design automation

Documents