performance and reliability issues – network, storage & services
DESCRIPTION
Performance and Reliability Issues – Network, Storage & Services. Shawn McKee/University of Michigan OSG All-hands Meeting March 8 th 2010, FNAL. Outline. I want to present a mix of topics related to performance and reliability for our sites - PowerPoint PPT PresentationTRANSCRIPT
Performance and Reliability Performance and Reliability Issues – Network, Storage & Issues – Network, Storage & ServicesServices
Shawn McKee/University of MichiganOSG All-hands MeetingMarch 8th 2010, FNAL
OutlineOutlineI want to present a mix of topics related to
performance and reliability for our sitesNot composed of “the answers” but rather
a set of what I consider important topics and examples followed by discussion
I will cover Network, Storage and Services◦ Configuration◦ Tuning◦ Monitoring◦ Management
March 8, 2010 2OSG All-hands USATLAS Meeting
General Goals for our General Goals for our SitesSites
Goal: Build a Robust Infrastructure◦Consider physical and logical topologies◦Provide alternate paths when feasible◦Tune, test, monitor and manage
Meta-Goal: Protect Services while Maintaining Performance◦Services should be configured in such a
way that they “fail gracefully” rather than crashing. Potentially many ways to do this
◦Tune, test, monitor and manage (as always)
March 8, 2010 OSG All-hands USATLAS Meeting 3
Common ProblemsCommon ProblemsPower issuesSite (mis)configurationsService Failures
◦Load related, Bugs, Configuration, Updates
Hardware Failures◦Disks, Memory, CPU, etc
Cooling FailuresNetwork FailuresRobust solutions are needed to
minimize these impactsMarch 8, 2010 OSG All-hands USATLAS Meeting 4
Site InfrastructuresSite InfrastructuresThere are a number of areas to
examine where we can add robustness (usually at the cost of $ or complexity !)◦Networking
Physical and logical connectivity
◦Storage Physical and logical connectivity Filesystems, OS, Software, Services
◦Servers and Services Grid and VO software and middleware
March 8, 2010 OSG All-hands USATLAS Meeting 5
Example Site-to-Site Example Site-to-Site DiagramDiagram
March 8, 2010 OSG All-hands USATLAS Meeting 6
Power IssuesPower IssuesPower issues can frequently be the cause
of service loss in our infrastructureRedundant power-supplies connected to
independent circuits can minimize loss due to circuit or supply failure (Verify 1 circuit can support the required load!!)
UPS systems can bridge brown-outs or short-duration loses and protect equipment from power fluctuations
Generators can provide longer-term bridging
March 8, 2010 OSG All-hands USATLAS Meeting 7
Robust Network Robust Network ConnectivityConnectivityRedundant network connectivity
can help provide robust networking◦WAN resiliency is part of almost all
WAN providers infrastructure◦Sites need to determine how best to
provide both LAN and connector-level resiliency
Basically, allow for multiple paths for network traffic to flow in case of switch/router failure, cabling mishaps, NIC failure, etc
March 8, 2010 OSG All-hands USATLAS Meeting 8
Virtual Circuits in LHC Virtual Circuits in LHC (WAN)(WAN)ESnet ESnet and Internet2Internet2 have helped the LHC sites in the US setup end-to-end circuits
USATLAS USATLAS has persistent circuits from BNL to 4 of the 5 Tier-2s◦The circuits are guaranteed 1 Gbps but
may overflow to utilize the available bandwidth
This simplifies traffic management and is transparent to the sites.
Future possibilities for dynamic mgmt…Failover is back to default routing
March 8, 2010 9OSG All-hands USATLAS Meeting
LAN Options to ConsiderLAN Options to ConsiderUtilize equipment of reasonable
quality. Managed switches typically are more robust as well as configurable and support monitoring
Within your LAN have redundant switches with paths managed by spanning-tree to increase uptime
Anticipate likely failure modes…At the host level you can utilize
multiple NICS (bonding)March 8, 2010 OSG All-hands USATLAS Meeting 10
Example: Network Example: Network BondingBonding
You can configure multiple network interfaces on a host to cooperate as a single virtual interface via “bonding”
Linux allows multiple “modes” for the bonding configuration (see next page)
Trade-offs based upon resiliency vs performance as well as those related to hardware capabilities and topology.
March 8, 2010 OSG All-hands USATLAS Meeting 11
NIC Bonding ModesNIC Bonding Modes Mode 0 – Balance Round-Robin: the only mode allowing a
single flow to balance over more than one NIC BUT reorders packets. Requires ‘etherchannel’ or ‘trunking’ on the switch
Mode 1 – Active-Backup: Allows connecting to different switches at different speeds. No throughput benefit but redundant.
Mode 2 – Balance-XOR: Selects NIC per destination based upon XOR of MAC addresses. Needs ‘etherchannel’ or ‘trunk’
Mode 3 – Balance: Transmits on all slaves. Needs distinct nets.
Mode 4 – 802.3ad: Active-active, specific flows select NIC based upon chosen algorithm. Needs switch support for 802.3ad
Mode 5 – Balance-tlb: Adaptive transmit load balancing. Output balanced based upon current slave loads. No special switch support required. NIC must support ‘ethtool’
Mode 6 – Balance-alb: Adaptive load balancing. Similar to 5 but allows receive balancing via “arp” manipulation.
March 8, 2010 OSG All-hands USATLAS Meeting 12
Network Tuning (1/2)Network Tuning (1/2)Typical “default” OS tunings for
networking are not optimal for WAN data transmission.
Depending upon the OS you can find particular tuning advice at: http://fasterdata.es.net/TCP-tuning/background.html
Buffers are the primary tuning target: buffer size = bandwidth * RTT
Good news: most OSes support autotuning now=> no need to set default buffer sizes
March 8, 2010 OSG All-hands USATLAS Meeting 13
Network Tuning (2/2)Network Tuning (2/2)To get maximal throughput it is
critical to use optimal TCP buffer sizes ◦If the buffers are too small, the TCP
congestion window will never fully open up.
◦If the receiver buffers are too large, TCP flow control breaks; the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host.
March 8, 2010 OSG All-hands USATLAS Meeting 14
Linux TCP Tuning (1/2)Linux TCP Tuning (1/2)Like all operating systems, the default
maximum Linux TCP buffer sizes are way too small
# increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # set max to at least 4MB, higher if you use very high BDP
paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216You should also verify that the following are all
set to the default value of 1 sysctl net.ipv4.tcp_window_scaling sysctl net.ipv4.tcp_timestamps sysctl net.ipv4.tcp_sack Of course, TEST after changes. SACK may need to be off
for large BDP paths (> 16MB) or timeouts may result.March 8, 2010 OSG All-hands USATLAS Meeting 15
Linux TCP Tuning (2/2)Linux TCP Tuning (2/2)Tuning can be more complex for 10GEYou can explore different congestion
algorithms: BIC, CUBIC, HTCP, etc.Large MTU can improve throughputThere are a couple additional sysctl settings for 2.6: # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500# for 10 GigE, use 30000
March 8, 2010 OSG All-hands USATLAS Meeting 16
Storage ConnectivityStorage ConnectivityIncrease robustness for storage by
providing resiliency at various levels: ◦Network: Bonding (e.g. 802.3ad)◦Raid/SCSI redundant cabling,
multipathing (hw specific)◦iSCSI (with redundant connections)◦Single-Host resiliency: redundant
power, mirrored memory, RAID OS disks, multipath controllers
◦Clustered/failover storage servers◦Multiple copies, multiple write locations
March 8, 2010 OSG All-hands USATLAS Meeting 17
Example: Redundant Cabling Example: Redundant Cabling Using Dell MD1000sUsing Dell MD1000sNew firmware for Dell RAID controllers
supports redundant cabling of MD1000sEach MD1000 can have two EMMs, each
capable of accessing all disksA Perc6/E has two SAS channelsCan now cable each channel to an EMM
on a shelf. Connection shows 1 logical link (similar to “bond” in networking)
Can be daisy-chained to 3 MD1000’s
March 8, 2010 OSG All-hands USATLAS Meeting 18
Redundant Path With Static Redundant Path With Static Load Balancing SupportLoad Balancing Support
The PERC 6/E adapter can detect and use redundant paths to drives contained in enclosures. This provides the ability to connect two SAS cables between a controller and an enclosure for path redundancy. The controller is able to tolerate the failure of a cable or Enclosure Management Module (EMM) by utilizing the remaining path.
When redundant paths exist, the controller automatically balances I/O load through both paths to each disk drive. This load balancing feature increases throughput to each drive and is automatically turned on when redundant paths are detected. To set up your hardware to support redundant paths, see Setting up Redundant Path Support on the PERC 6/E Adapter.
NOTE: This support for redundant paths refers to path-redundancy only and not to controller-redundancy“
March 8, 2010 OSG All-hands USATLAS Meeting 19
http://support.dell.com/support/edocs/storage/RAID/PERC6/en/UG/HTML/chapterd.htm#wp1068896
Storage TuningStorage TuningHave good hardware underneath the storage system!Pick an underlying filesystem that performs well. XFS is a common choice which supports large number of directory entries and online defragmentation.
The following settings require the target to be mounted:
Set “readahead” to improve read speed (4096-16384) blockdev --setra 10240 $devSetup queuing requests (allows optimizing) echo 512 > /sys/block/${sd}/queue/nr_requestsPick an I/O scheduler suitable for your task echo deadline > /sys/block/${sd}/queue/scheduler
There are often hardware specific tunings possible. Remember to test for your expected workload to see if changes help.
March 8, 2010 OSG All-hands USATLAS Meeting 20
Robust Grid Services?Robust Grid Services?Just a topic I wanted to mention. I
would like to be able to configure virtual grid services (using multiple hosts, heartbeat, LVS, etc.) to create a robust infrastructure.
Primary targets:◦Gatekeepers, Job schedulers, GUMS
servers, LFC, software servers, dCache admin servers
◦Possible solution for NFS servers via heartbeat, LVS…others?
March 8, 2010 OSG All-hands USATLAS Meeting 21
Virtualization of Service Virtualization of Service NodesNodesOur current grid infrastructure for
ATLAS requires a number of servicesVirtualization technologies can be
used to provide some of these services
Depending upon the virtualization system this can help:◦Backing up critical services◦Increasing availability◦Easing management
March 8, 2010 OSG All-hands USATLAS Meeting 22
Example: VMwareExample: VMwareAt AGLT2 we have VMware VMware
Enterprise Enterprise running:◦LFC,3 Squid servers,OSG Gatekeeper,
ROCKS headnodes (dev/prod), 2 of 3 Kerb/AFS/NIS nodes, central syslog-ng host,muon splitter, 2 of 5 AFS file servers
“HA” can ensure services run even if a server fails. Backup is easy as well
Can “live-migate” VMs between 3 servers or migrate VM storage to alternate back-end storage server
March 8, 2010 OSG All-hands USATLAS Meeting 23
Example: AGLT2 VMwareExample: AGLT2 VMware
March 8, 2010 OSG All-hands USATLAS Meeting 24
Not shown are the 10GE Connections 1/server
Example: Details for Example: Details for UMVM02UMVM02
March 8, 2010 OSG All-hands USATLAS Meeting 25
BackupsBackups“You do have backups, right?...”Scary question, huh?! Backups
provide a form of resiliency against various hardware failures and unintentional acts of stupidity.
Could be anything from a full tape system backup services to various cron scripts saving needed config info.
Not always easy to get right…test!March 8, 2010 OSG All-hands USATLAS Meeting 26
System TuningSystem TuningLots of topics could be put here but I
will just mention a few itemsYou can install ‘ktune’ (yum install
ktune). It will provide some tunings for large memory systems running disk and network intensive applications.
See related storage/network tuningsMemory is a likely bottleneck in
many cases…have lots!
March 8, 2010 OSG All-hands USATLAS Meeting 27
Cluster Monitoring…Cluster Monitoring…This is a huge topic. In general you
can’t find problems if you don’t know about them and you can’t effectively manage systems if you can’t monitor them
I will list a few monitoring programs that I have found useful.
There are many options in this area that I won’t cover: NagiosNagios is a prime example being very successfully used.
March 8, 2010 OSG All-hands USATLAS Meeting 28
GangliaGangliaGanglia is a cluster monitoring
program available from http://ganglia.sourceforge.net/ and also distributed as part of ROCKS
Allows a quick view of CPU and memory use cluster-wide
Can drill down into host specific details
Can easily extend to monitor additional data or aggregate sites
March 8, 2010 OSG All-hands USATLAS Meeting 29
Example Ganglia InterfaceExample Ganglia Interface
March 8, 2010 OSG All-hands USATLAS Meeting 30
Cacti MonitoringCacti MonitoringCacti ( see http://www.cacti.net/ )
is a network graphing package using SNMP and RRDtool to record data
Can be extended with plugins (threshold, monitoring, MAC lookup)
March 8, 2010 OSG All-hands USATLAS Meeting 31
Example Cacti GraphsExample Cacti Graphs
March 8, 2010 OSG All-hands USATLAS Meeting 32
Inbound AGLT2 10GE Bytes/sec Outbound AGLT2 10GE Bytes/sec
Aggregate ‘ntpd’ offset (ms) Space-tokens stats (put/get)
Postgres DB stats NFS client statistics
Custom MonitoringCustom MonitoringPhilippe Laurens (MSU) has
developed a summary page for AGLT2 which quickly shows cluster status:
March 8, 2010 OSG All-hands USATLAS Meeting 33
Automated Automated Monitoring/RecoveryMonitoring/RecoverySome types of problems can be easily
“fixed” if we can just identify themThe ‘monit’ software (‘yum install
monit’) can provide an easy way to test various system/software components and attempt to remediate problems.
Configure a file per item to watch/testVery configurable; can fix problems at
3AM! Some examples follow:
March 8, 2010 OSG All-hands USATLAS Meeting 34
Monit Example for MySQLMonit Example for MySQLThis describes the relevant MySQL info for
this host. # mysqld monitoringcheck process mysqld with pidfile /var/lib/mysql/dq2.aglt2.org.pidgroup databasestart program = "/etc/init.d/mysql start"stop program = "/etc/init.d/mysql stop"if failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then
restartif failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then alertif failed unixsocket /var/lib/mysql/mysql.sock protocol mysql 4 cycles
then alertif 5 restarts within 10 cycles then timeout
Restarting and alerting are triggered based upon tests.
Resides in /etc/monit.d as mysqld.conf
March 8, 2010 OSG All-hands USATLAS Meeting 35
Other Other Monitoring/ManagementMonitoring/ManagementLots of sites utilize “simple” scripts
run via “cron” (or equivalent) that:◦Perform regular maintenance◦Check for “known” problems◦Backup data or configurations◦Extract monitoring data◦Remediate commonly occurring failures
These can be very helpful for increasing reliability and performance
March 8, 2010 OSG All-hands USATLAS Meeting 36
Security ConsiderationsSecurity ConsiderationsSecurity is a whole separate topic…
not appropriate to cover it here…General issue is that unless security
is also addressed, your otherwise high-performing robust infrastructure may have large downtimes while you try to contain and repair system compromises!
Good security practices are part of building robust infrastructures.
March 8, 2010 OSG All-hands USATLAS Meeting 37
Configuration Configuration ManagementManagement
Not directly related to performance or reliability but very important
Common tools: ◦Code management, versioning
(Subversion, CVS) ◦Provisioning and configuration
managment (ROCKS, Kickstart, Puppet, Cfengine)
All important for figuring out what was changed and what is currently configured
March 8, 2010 OSG All-hands USATLAS Meeting 38
Regular Storage Regular Storage “Maintenance”“Maintenance”Start with the bits on disk. Run
‘smartd’ to look for impending failuresUse “patrol reads” or background
consistency checks to find bad sectorsRun filesystem checks when things
are “suspicious” (xfs_repair, fsck…)Run higher level consistency checks
(like Charle’s ccc.pyccc.py script) to insure various views of your storage are consistent
March 8, 2010 OSG All-hands USATLAS Meeting 39
High Level Storage High Level Storage ConsistencyConsistency
March 8, 2010 OSG All-hands USATLAS Meeting 40
Being run at MWT2 and AGLT2
Allows finding consistency problems and “dark” data
dCache dCache Monitoring/ManagementMonitoring/Management
AGLT2 has monitoring/mgmt we do specific to dCache (as an example)
Other storage solutions may have similar types of monitoring
We have developed some custom pages in addition to the standard dCache services web interface
Tracks usage and consistencyAlso have a series of scripts running
in ‘cron’ doing routine maintenance/checks
March 8, 2010 OSG All-hands USATLAS Meeting 41
dCache Allocation and UsedCache Allocation and Use
March 8, 2010 OSG All-hands USATLAS Meeting 42
dCache Consistency PagedCache Consistency Page
March 8, 2010 OSG All-hands USATLAS Meeting 43
WAN Network MonitoringWAN Network MonitoringWithin the Throughput group we
have been working on network monitoring as complementary to throughput testing
Two measurement/monitoring areas:◦perfSONAR at Tier-1/Tier-2 sites
“Network” specific testing
◦Automated transfer testing “End-to-end” using standard ATLAS tools
◦May add a “transaction test” next (TBD)March 8, 2010 OSG All-hands USATLAS Meeting 44
Network Monitoring: Network Monitoring: perfSONARperfSONAR
March 8, 2010 OSG All-hands USATLAS Meeting 45
As you are by now well aware there is a broad scale effort to standardize network monitoring under the perfSONAR framework
Since the network is so fundamental to our work we targeted implementation of a perfSONAR instance at all our primary facilities. We have ~20 sites running
Has already proven very useful in USATLAS!
perfSONAR Examples perfSONAR Examples USATLASUSATLAS
March 8, 2010 46OSG All-hands USATLAS Meeting
perfSONAR in USATLASperfSONAR in USATLASThe typical Tier-1/Tier-2 installation
provides two systems (using the same KOI hardware at each site): latency and bandwidth nodes
Automated recurring tests are configured for both latency and bandwidth between all Tier-1/Tier-2 sites (“mesh” testing)
We are acquiring a baseline and history of network performance between sites
On demand testing is also available
March 8, 2010 OSG All-hands USATLAS Meeting 47
Production System TestingProduction System TestingWhile perfSONARperfSONAR is becoming the tool of
choice for monitoring the network behavior between sites, we also need to track the “end-to-end” behavior of our complex, distributed systems.
We are utilizing regularly scheduled automated testing, sending specific data between sites to verify proper operation.
This is critical for problem isolation; comparing network and application results can pin-point problem locations
March 8, 2010 48OSG All-hands USATLAS Meeting
Automated Data Transfer Automated Data Transfer TestsTestsAs part of USATLAS Throughput work, Hiro
has developed an automated data transfer system which utilizes the standard ATLAS DDM system
This allows us to monitor the throughput of the system on a regular basis
It transfers a set of files once per day from the Tier-1 to each Tier-2 for two different destinations.
Recently it was extended to allow arbitrary source/destination (including Tier-3s)
http://www.usatlas.bnl.gov/dq2/throughput
March 8, 2010 OSG All-hands USATLAS Meeting 49
Web Interface to Throughput Web Interface to Throughput TestTest
March 8, 2010 OSG All-hands USATLAS Meeting 50
Throughput Test Graph Throughput Test Graph #1#1
March 8, 2010 OSG All-hands USATLAS Meeting 51
Throughput Test Graph Throughput Test Graph #2#2
March 8, 2010 OSG All-hands USATLAS Meeting 52
Throughput Test Graph Throughput Test Graph #3#3
March 8, 2010 OSG All-hands USATLAS Meeting 53
Future Throughput WorkFuture Throughput WorkWith the recent release of an updated
perfSONARperfSONAR we are in position to acquire useful baseline network performance between our sites
A number of potential network issues requiring some debugging are starting to appear.
As we acquire data, both from perfSONAR and throughput testing, we need to start developing higher-level diagnostics and alerting systems (How best to integrate with “Operations”?)
March 8, 2010 OSG All-hands USATLAS Meeting 54
General Considerations General Considerations (1/2)(1/2)
Lots of things can impact both reliability and performance
At the hardware level:◦Check for driver updates ◦Examine firmware/bios versions (newer
isn’t always better BTW)Software versions…fixes for problems?Test changes – Do they do what you
thought? What else did they break?
March 8, 2010 OSG All-hands USATLAS Meeting 55
General Considerations General Considerations (2/2)(2/2)
Sometimes the additional complexity to add “resiliency” actually decreases availability compared to doing nothing!
Having test equipment to experiment with is critical for trying new options
Often you need to trade-off cost vs performance vs reliability (pick 2 )
Documentation, issue tracking and version control systems are your friends!
March 8, 2010 OSG All-hands USATLAS Meeting 56
SummarySummaryThere are many components and
complex interactions possible in our sites
We need to understand our options (frequently site specific) to help create robust, high-performing infrastructures
Monitoring is central to delivering both reliability and performance
Reminder: test changes to make sure they actually do what you want (and not something you don’t want!)
March 8, 2010 OSG All-hands USATLAS Meeting 57
?Questions??Questions?
March 8, 2010 OSG All-hands USATLAS Meeting 58
Backup SlidesBackup Slides
March 8, 2010 59OSG All-hands USATLAS Meeting