performance and reliability issues – network, storage & services

Performance and Reliability Performance and Reliability Issues – Network, Storage & Issues – Network, Storage & ServicesServices

Shawn McKee/University of MichiganOSG All-hands MeetingMarch 8th 2010, FNAL

OutlineOutlineI want to present a mix of topics related to

performance and reliability for our sitesNot composed of “the answers” but rather

a set of what I consider important topics and examples followed by discussion

I will cover Network, Storage and Services◦ Configuration◦ Tuning◦ Monitoring◦ Management

March 8, 2010 2OSG All-hands USATLAS Meeting

General Goals for our General Goals for our SitesSites

Goal: Build a Robust Infrastructure◦Consider physical and logical topologies◦Provide alternate paths when feasible◦Tune, test, monitor and manage

Meta-Goal: Protect Services while Maintaining Performance◦Services should be configured in such a

way that they “fail gracefully” rather than crashing. Potentially many ways to do this

◦Tune, test, monitor and manage (as always)

March 8, 2010 OSG All-hands USATLAS Meeting 3

Common ProblemsCommon ProblemsPower issuesSite (mis)configurationsService Failures

◦Load related, Bugs, Configuration, Updates

Hardware Failures◦Disks, Memory, CPU, etc

Cooling FailuresNetwork FailuresRobust solutions are needed to

minimize these impactsMarch 8, 2010 OSG All-hands USATLAS Meeting 4

Site InfrastructuresSite InfrastructuresThere are a number of areas to

examine where we can add robustness (usually at the cost of $ or complexity !)◦Networking

Physical and logical connectivity

◦Storage Physical and logical connectivity Filesystems, OS, Software, Services

◦Servers and Services Grid and VO software and middleware


Example Site-to-Site Example Site-to-Site DiagramDiagram


Power IssuesPower IssuesPower issues can frequently be the cause

of service loss in our infrastructureRedundant power-supplies connected to

independent circuits can minimize loss due to circuit or supply failure (Verify 1 circuit can support the required load!!)

UPS systems can bridge brown-outs or short-duration loses and protect equipment from power fluctuations

Generators can provide longer-term bridging


Robust Network Robust Network ConnectivityConnectivityRedundant network connectivity

can help provide robust networking◦WAN resiliency is part of almost all

WAN providers infrastructure◦Sites need to determine how best to

provide both LAN and connector-level resiliency

Basically, allow for multiple paths for network traffic to flow in case of switch/router failure, cabling mishaps, NIC failure, etc


Virtual Circuits in LHC Virtual Circuits in LHC (WAN)(WAN)ESnet ESnet and Internet2Internet2 have helped the LHC sites in the US setup end-to-end circuits

USATLAS USATLAS has persistent circuits from BNL to 4 of the 5 Tier-2s◦The circuits are guaranteed 1 Gbps but

may overflow to utilize the available bandwidth

This simplifies traffic management and is transparent to the sites.

Future possibilities for dynamic mgmt…Failover is back to default routing


LAN Options to ConsiderLAN Options to ConsiderUtilize equipment of reasonable

quality. Managed switches typically are more robust as well as configurable and support monitoring

Within your LAN have redundant switches with paths managed by spanning-tree to increase uptime

Anticipate likely failure modes…At the host level you can utilize

multiple NICS (bonding)March 8, 2010 OSG All-hands USATLAS Meeting 10

Example: Network Example: Network BondingBonding

You can configure multiple network interfaces on a host to cooperate as a single virtual interface via “bonding”

Linux allows multiple “modes” for the bonding configuration (see next page)

Trade-offs based upon resiliency vs performance as well as those related to hardware capabilities and topology.


NIC Bonding ModesNIC Bonding Modes Mode 0 – Balance Round-Robin: the only mode allowing a

single flow to balance over more than one NIC BUT reorders packets. Requires ‘etherchannel’ or ‘trunking’ on the switch

Mode 1 – Active-Backup: Allows connecting to different switches at different speeds. No throughput benefit but redundant.

Mode 2 – Balance-XOR: Selects NIC per destination based upon XOR of MAC addresses. Needs ‘etherchannel’ or ‘trunk’

Mode 3 – Balance: Transmits on all slaves. Needs distinct nets.

Mode 4 – 802.3ad: Active-active, specific flows select NIC based upon chosen algorithm. Needs switch support for 802.3ad

Mode 5 – Balance-tlb: Adaptive transmit load balancing. Output balanced based upon current slave loads. No special switch support required. NIC must support ‘ethtool’

Mode 6 – Balance-alb: Adaptive load balancing. Similar to 5 but allows receive balancing via “arp” manipulation.


Network Tuning (1/2)Network Tuning (1/2)Typical “default” OS tunings for

networking are not optimal for WAN data transmission.

Depending upon the OS you can find particular tuning advice at: http://fasterdata.es.net/TCP-tuning/background.html

Buffers are the primary tuning target: buffer size = bandwidth * RTT

Good news: most OSes support autotuning now=> no need to set default buffer sizes


Network Tuning (2/2)Network Tuning (2/2)To get maximal throughput it is

critical to use optimal TCP buffer sizes ◦If the buffers are too small, the TCP

congestion window will never fully open up.

◦If the receiver buffers are too large, TCP flow control breaks; the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host.


Linux TCP Tuning (1/2)Linux TCP Tuning (1/2)Like all operating systems, the default

maximum Linux TCP buffer sizes are way too small

# increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # set max to at least 4MB, higher if you use very high BDP

paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216You should also verify that the following are all

set to the default value of 1 sysctl net.ipv4.tcp_window_scaling sysctl net.ipv4.tcp_timestamps sysctl net.ipv4.tcp_sack Of course, TEST after changes. SACK may need to be off

for large BDP paths (> 16MB) or timeouts may result.March 8, 2010 OSG All-hands USATLAS Meeting 15

Linux TCP Tuning (2/2)Linux TCP Tuning (2/2)Tuning can be more complex for 10GEYou can explore different congestion

algorithms: BIC, CUBIC, HTCP, etc.Large MTU can improve throughputThere are a couple additional sysctl settings for 2.6: # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500# for 10 GigE, use 30000


Storage ConnectivityStorage ConnectivityIncrease robustness for storage by

providing resiliency at various levels: ◦Network: Bonding (e.g. 802.3ad)◦Raid/SCSI redundant cabling,

multipathing (hw specific)◦iSCSI (with redundant connections)◦Single-Host resiliency: redundant

power, mirrored memory, RAID OS disks, multipath controllers

◦Clustered/failover storage servers◦Multiple copies, multiple write locations


Example: Redundant Cabling Example: Redundant Cabling Using Dell MD1000sUsing Dell MD1000sNew firmware for Dell RAID controllers

supports redundant cabling of MD1000sEach MD1000 can have two EMMs, each

capable of accessing all disksA Perc6/E has two SAS channelsCan now cable each channel to an EMM

on a shelf. Connection shows 1 logical link (similar to “bond” in networking)

Can be daisy-chained to 3 MD1000’s


Redundant Path With Static Redundant Path With Static Load Balancing SupportLoad Balancing Support

The PERC 6/E adapter can detect and use redundant paths to drives contained in enclosures. This provides the ability to connect two SAS cables between a controller and an enclosure for path redundancy. The controller is able to tolerate the failure of a cable or Enclosure Management Module (EMM) by utilizing the remaining path.

When redundant paths exist, the controller automatically balances I/O load through both paths to each disk drive. This load balancing feature increases throughput to each drive and is automatically turned on when redundant paths are detected. To set up your hardware to support redundant paths, see Setting up Redundant Path Support on the PERC 6/E Adapter.

NOTE: This support for redundant paths refers to path-redundancy only and not to controller-redundancy“


http://support.dell.com/support/edocs/storage/RAID/PERC6/en/UG/HTML/chapterd.htm#wp1068896

Storage TuningStorage TuningHave good hardware underneath the storage system!Pick an underlying filesystem that performs well. XFS is a common choice which supports large number of directory entries and online defragmentation.

The following settings require the target to be mounted:

Set “readahead” to improve read speed (4096-16384) blockdev --setra 10240 $devSetup queuing requests (allows optimizing) echo 512 > /sys/block/${sd}/queue/nr_requestsPick an I/O scheduler suitable for your task echo deadline > /sys/block/${sd}/queue/scheduler

There are often hardware specific tunings possible. Remember to test for your expected workload to see if changes help.


Robust Grid Services?Robust Grid Services?Just a topic I wanted to mention. I

would like to be able to configure virtual grid services (using multiple hosts, heartbeat, LVS, etc.) to create a robust infrastructure.

Primary targets:◦Gatekeepers, Job schedulers, GUMS

servers, LFC, software servers, dCache admin servers

◦Possible solution for NFS servers via heartbeat, LVS…others?


Virtualization of Service Virtualization of Service NodesNodesOur current grid infrastructure for

ATLAS requires a number of servicesVirtualization technologies can be

used to provide some of these services

Depending upon the virtualization system this can help:◦Backing up critical services◦Increasing availability◦Easing management


Example: VMwareExample: VMwareAt AGLT2 we have VMware VMware

Enterprise Enterprise running:◦LFC,3 Squid servers,OSG Gatekeeper,

ROCKS headnodes (dev/prod), 2 of 3 Kerb/AFS/NIS nodes, central syslog-ng host,muon splitter, 2 of 5 AFS file servers

“HA” can ensure services run even if a server fails. Backup is easy as well

Can “live-migate” VMs between 3 servers or migrate VM storage to alternate back-end storage server


Example: AGLT2 VMwareExample: AGLT2 VMware


Not shown are the 10GE Connections 1/server

Example: Details for Example: Details for UMVM02UMVM02


BackupsBackups“You do have backups, right?...”Scary question, huh?! Backups

provide a form of resiliency against various hardware failures and unintentional acts of stupidity.

Could be anything from a full tape system backup services to various cron scripts saving needed config info.

Not always easy to get right…test!March 8, 2010 OSG All-hands USATLAS Meeting 26

System TuningSystem TuningLots of topics could be put here but I

will just mention a few itemsYou can install ‘ktune’ (yum install

ktune). It will provide some tunings for large memory systems running disk and network intensive applications.

See related storage/network tuningsMemory is a likely bottleneck in

many cases…have lots!


Cluster Monitoring…Cluster Monitoring…This is a huge topic. In general you

can’t find problems if you don’t know about them and you can’t effectively manage systems if you can’t monitor them

I will list a few monitoring programs that I have found useful.

There are many options in this area that I won’t cover: NagiosNagios is a prime example being very successfully used.


GangliaGangliaGanglia is a cluster monitoring

program available from http://ganglia.sourceforge.net/ and also distributed as part of ROCKS

Allows a quick view of CPU and memory use cluster-wide

Can drill down into host specific details

Can easily extend to monitor additional data or aggregate sites


Example Ganglia InterfaceExample Ganglia Interface


Cacti MonitoringCacti MonitoringCacti ( see http://www.cacti.net/ )

is a network graphing package using SNMP and RRDtool to record data

Can be extended with plugins (threshold, monitoring, MAC lookup)


Example Cacti GraphsExample Cacti Graphs


Inbound AGLT2 10GE Bytes/sec Outbound AGLT2 10GE Bytes/sec

Aggregate ‘ntpd’ offset (ms) Space-tokens stats (put/get)

Postgres DB stats NFS client statistics

Custom MonitoringCustom MonitoringPhilippe Laurens (MSU) has

developed a summary page for AGLT2 which quickly shows cluster status:


Automated Automated Monitoring/RecoveryMonitoring/RecoverySome types of problems can be easily

“fixed” if we can just identify themThe ‘monit’ software (‘yum install

monit’) can provide an easy way to test various system/software components and attempt to remediate problems.

Configure a file per item to watch/testVery configurable; can fix problems at

3AM! Some examples follow:


Monit Example for MySQLMonit Example for MySQLThis describes the relevant MySQL info for

this host. # mysqld monitoringcheck process mysqld with pidfile /var/lib/mysql/dq2.aglt2.org.pidgroup databasestart program = "/etc/init.d/mysql start"stop program = "/etc/init.d/mysql stop"if failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then

restartif failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then alertif failed unixsocket /var/lib/mysql/mysql.sock protocol mysql 4 cycles

then alertif 5 restarts within 10 cycles then timeout

Restarting and alerting are triggered based upon tests.

Resides in /etc/monit.d as mysqld.conf


Other Other Monitoring/ManagementMonitoring/ManagementLots of sites utilize “simple” scripts

run via “cron” (or equivalent) that:◦Perform regular maintenance◦Check for “known” problems◦Backup data or configurations◦Extract monitoring data◦Remediate commonly occurring failures

These can be very helpful for increasing reliability and performance


Security ConsiderationsSecurity ConsiderationsSecurity is a whole separate topic…

not appropriate to cover it here…General issue is that unless security

is also addressed, your otherwise high-performing robust infrastructure may have large downtimes while you try to contain and repair system compromises!

Good security practices are part of building robust infrastructures.


Configuration Configuration ManagementManagement

Not directly related to performance or reliability but very important

Common tools: ◦Code management, versioning

(Subversion, CVS) ◦Provisioning and configuration

managment (ROCKS, Kickstart, Puppet, Cfengine)

All important for figuring out what was changed and what is currently configured


Regular Storage Regular Storage “Maintenance”“Maintenance”Start with the bits on disk. Run

‘smartd’ to look for impending failuresUse “patrol reads” or background

consistency checks to find bad sectorsRun filesystem checks when things

are “suspicious” (xfs_repair, fsck…)Run higher level consistency checks

(like Charle’s ccc.pyccc.py script) to insure various views of your storage are consistent


High Level Storage High Level Storage ConsistencyConsistency


Being run at MWT2 and AGLT2

Allows finding consistency problems and “dark” data

dCache dCache Monitoring/ManagementMonitoring/Management

AGLT2 has monitoring/mgmt we do specific to dCache (as an example)

Other storage solutions may have similar types of monitoring

We have developed some custom pages in addition to the standard dCache services web interface

Tracks usage and consistencyAlso have a series of scripts running

in ‘cron’ doing routine maintenance/checks


dCache Allocation and UsedCache Allocation and Use


dCache Consistency PagedCache Consistency Page


WAN Network MonitoringWAN Network MonitoringWithin the Throughput group we

have been working on network monitoring as complementary to throughput testing

Two measurement/monitoring areas:◦perfSONAR at Tier-1/Tier-2 sites

“Network” specific testing

◦Automated transfer testing “End-to-end” using standard ATLAS tools

◦May add a “transaction test” next (TBD)March 8, 2010 OSG All-hands USATLAS Meeting 44

Network Monitoring: Network Monitoring: perfSONARperfSONAR


As you are by now well aware there is a broad scale effort to standardize network monitoring under the perfSONAR framework

Since the network is so fundamental to our work we targeted implementation of a perfSONAR instance at all our primary facilities. We have ~20 sites running

Has already proven very useful in USATLAS!

perfSONAR Examples perfSONAR Examples USATLASUSATLAS


perfSONAR in USATLASperfSONAR in USATLASThe typical Tier-1/Tier-2 installation

provides two systems (using the same KOI hardware at each site): latency and bandwidth nodes

Automated recurring tests are configured for both latency and bandwidth between all Tier-1/Tier-2 sites (“mesh” testing)

We are acquiring a baseline and history of network performance between sites

On demand testing is also available


Production System TestingProduction System TestingWhile perfSONARperfSONAR is becoming the tool of

choice for monitoring the network behavior between sites, we also need to track the “end-to-end” behavior of our complex, distributed systems.

We are utilizing regularly scheduled automated testing, sending specific data between sites to verify proper operation.

This is critical for problem isolation; comparing network and application results can pin-point problem locations


Automated Data Transfer Automated Data Transfer TestsTestsAs part of USATLAS Throughput work, Hiro

has developed an automated data transfer system which utilizes the standard ATLAS DDM system

This allows us to monitor the throughput of the system on a regular basis

It transfers a set of files once per day from the Tier-1 to each Tier-2 for two different destinations.

Recently it was extended to allow arbitrary source/destination (including Tier-3s)

http://www.usatlas.bnl.gov/dq2/throughput


Web Interface to Throughput Web Interface to Throughput TestTest


Throughput Test Graph Throughput Test Graph #1#1


Future Throughput WorkFuture Throughput WorkWith the recent release of an updated

perfSONARperfSONAR we are in position to acquire useful baseline network performance between our sites

A number of potential network issues requiring some debugging are starting to appear.

As we acquire data, both from perfSONAR and throughput testing, we need to start developing higher-level diagnostics and alerting systems (How best to integrate with “Operations”?)


General Considerations General Considerations (1/2)(1/2)

Lots of things can impact both reliability and performance

At the hardware level:◦Check for driver updates ◦Examine firmware/bios versions (newer

isn’t always better BTW)Software versions…fixes for problems?Test changes – Do they do what you

thought? What else did they break?


General Considerations General Considerations (2/2)(2/2)

Sometimes the additional complexity to add “resiliency” actually decreases availability compared to doing nothing!

Having test equipment to experiment with is critical for trying new options

Often you need to trade-off cost vs performance vs reliability (pick 2 )

Documentation, issue tracking and version control systems are your friends!


SummarySummaryThere are many components and

complex interactions possible in our sites

We need to understand our options (frequently site specific) to help create robust, high-performing infrastructures

Monitoring is central to delivering both reliability and performance

Reminder: test changes to make sure they actually do what you want (and not something you don’t want!)


?Questions??Questions?


Backup SlidesBackup Slides


performance and reliability issues – network, storage & services

Documents

network traffic

reliability issues network

persistent circuits

independent circuits

sthe circuits

nic failure

supply failure

lhc sites