Download - December, 2006F I N I S A R C O R P O R A T I O N Finisar Corporation Monitoring Performance
December, 2006 F I N I S A R C O R P O R A T I O N
Finisar Corporation
Monitoring PerformanceMonitoring Performance
complexitycomplexity
heterogeneityheterogeneity
virtualizationvirtualization
changechange
fabric blindness
fabric blindness
The growing SAN challenge
you don’t know what to do when things go wrong
you don’t know the source of
SAN issues
you don’t know what you can’t see
you don’t know what to do when things go wrong
you don’t know the source of
SAN issues
you don’t know what you can’t see
Fabric blindness leads to…
Application brownouts or blackouts occur - and have significant business impactbusiness impact
Frantic fire-fighting
Internal finger-pointing Application vs. network vs. storage
External finger-pointing Vendors
Unacceptably long resolution times
Information Highway
Network performance shares many similarities with that of your daily commute
Storage Area Networks are no exception
Just like a large sprawling city, as SAN’s grow performance becomes more difficult to ensure
Lets take a look at planning for a faster commute
Is Performance Important?
Bugatti Veyron – The fastest production car 0 – 60 mph in 3.2 seconds Top speed well over 200 mph Price more than $1,000,000
Chevy Matiz – One of the slowest cars 0 – 60 mph in 21.9 seconds Top speed about 85 mph Price about $10,000
The Difference 6.8 times the acceleration 3 – 4 times as fast More than 10 times the price
The Real Difference
In this environment they both go they same speed.
In fact in most environments they would have roughly the same time from A to B.
So maybe the right question is when is performance important and how is it measured
Rush Hour
In LA which has the worst rush hour commute time there is an 81% average delay during rush hour
Often certain routes are congested while others have limited traffic that is not affected
On many SANs there is a 500% average delay during peak times
There is no notification of a problem (time out) until it is at 6000% of normal maximums and 75,000% of the low load average
Queues (just like on ramps) can fill even at low bandwidth conditions
Often certain routes are congested while others have limited traffic that is not affected
Rush Hour
The impact of accidents
The impact of accidents depends on their severity
Pileups can result in routes that are impassable
Minor accidents can cause delays that far exceed even the impact of rush hour
The impact of errors
The impact of errors depend on the severity of the issue
Physical errors can result in routes that are unusable
Occasional errors can cause delays that far exceed even the impact of rush hour
Patch Work
Often short term solutions to problems become long term hazards
Patch Work
Often short term solutions to problems become long term hazards
Planning for and monitoring the commute
City planners architect the roadways for what they believe will be the commute demands
In some cases they use simulation to compare various alternatives
Finally they monitor the traffic patterns to prevent and resolve problems and better plan for the future
SAN Architects plan the fabrics for what they believe will be the storage demands
In some cases they use simulation and tests to compare various alternatives
Finally they monitor the traffic patterns to prevent and resolve problems and better plan for the future
Planning for and monitoring the SAN
Planning
The roadways are designed for the expected traffic loadsOften one of the biggest mistakes in the planning is using information that is out of date or incorrect assumptions.
Planning
The Fabrics are designed for the expected traffic loads
Often one of the biggest mistakes in the planning is using information that is out of date or incorrect assumptions.
Simulations are sometimes used to compare changes
Simulations are sometimes used to compare changes
Monitoring: I/O’s Per Second
Which route has more cars passing by every second?
In this scenario they could all be the same…Some with a few cars moving very fast while others with many cars that are going slowSo what if anything does that measurement tell us about performance?
Which route has more MB’s passing through every second
In this scenario they could all be the same…Some with no requests and some with slow request due to congestion
So what if anything does that measurement tell us about performance?
Monitoring: I/O’s Per Second
Looks at the real traffic flowsCan assess performancePinpoints the source of slow downs such as accidents and congestionSpeeds resolution to many of the problemsIn many cases helps to prevent issues from becoming problems
Modern Monitoring
Different method of Network Monitoring
Software Monitoring
Software Monitoring No interfering on the physical link Software Agent needed Effected by host system performance
Hardware Monitoring Isolate from Software and Host issue Intrusive on the physical link Dedicated monitoring HW.
Modern Monitoring
single TAP
Looks at the real traffic flowsCan assess performancePinpoints the source of slow downs such as accidents and congestionSpeeds resolution to many of the problemsIn many cases helps to prevent issues from becoming problems
Performance Analysis and Tuning
Request size and Queue dept are two keys contribution to performance tuning
Pre-Production run with variable queue dept and request size. Higher Queuept could increased throughput but also could cause
congestion and reduce throughput
Queue = 2
Queue = 4
Queue = 8
Queue = 16
Performance Analysis and Tuning
Read size 8 Kb with variable queue dept setting. Response time range from 10ms to 65ms. The ideal Queue dept for this system would be at 8 with 8Kb i/o
Performance Analysis and Tuning
Queue dept of 4 with variable read size
Throughput gain at the expense of latency
At 32k I/O throughput gain is no longer keeping up with the latency
Good Performance Monitoring
Does not focus on the irrelevant
Alarm for know issues
Unless there is an increasing pattern
Good Performance Monitoring
Does not focus on the irrelevant
Alarm for know issues
Unless there is an increasing pattern
Effects of SAN performance monitoring
Eliminate internal and vendor finger-pointing
Receive advance warning ofpotential problems
Reduce business riskriskrisk
Two recent customer case studies
Case 1A SAN problem was the root cause of an application disruption
Case 1A SAN problem was the root cause of an application disruption
Case 2A SAN problem was suspected as the root cause of an application disruption - but it was not the cause
Case 2A SAN problem was suspected as the root cause of an application disruption - but it was not the cause
Case 1: company profile
Large US insurance firm
Broad offering of insurance and financial products
10,000+ agents and employees
Large Microsoft Exchange implementation
Exchange data replicated to a remote site for backup and disaster recovery
Case 1: customer crisis
Exchange application slowed and became essentially unusable
User complaints flood IT
Business operations adversely impacted
Case 1: resolution efforts
Exchange server event log - no problemsStorage arrays log file - no problemsPrimary and secondary DR links tested - no problemsSwitch fabric manager - no problemsExchange throughput still low - pressure mounting - but no way to diagnose the problem. Elapsed time = 8+ hours
Case 1: Modern Performance monitoring
Probed storage link – unusually high Exchange Completion Times – proved SAN is the problemStorage array response – goodRemote replication acknowledgments – too long
Solution – re-route the DR traffic through secondary link – Exchange performance restored. Elapsed time = 30 minutesCause and Fix – Remote switch was busy dealing with RSCN storm because of a bad HBA in a unrelated application server in the remote site – Replaced HBA
Sync replication impact on production
Remote replication enabled
Remote replication disabled
Case 1: summary
Normal business operations were quickly restoredConclusive data that prevented finger-pointingWithout deep SAN Performance monitoring/analysis: it would have taken extraordinary effort to get to the root cause and resolutionIf deep SAN monitoring/analysis was in place: problem would have been prevented
Case 2: company profile
Large UK financial services firm
Assets of £540 billion
Over 20 million customers
Major UK mortgage and savings provider and credit card issuer
Relies on Oracle databases for transaction processing systems
Case 2: problem statement
Sudden, but intermittent slow down of Oracle-based applications
Widespread user complaints driving high level of internal visibility
Business operations adversely impacted
SAN was assumed to be the problem
Case 2: Modern Performance Monitoring
Deep SAN monitoring/analysis solution already in place
Quickly determined that all SAN parameters were within normal ranges - problem was not within the SAN
Trending report indicated time of problem occurrence - IT tracked back to an application “enhancement”
Elapsed time = <30 mins
Case 2: Modern Performance Monitoring
Increased link traffic
Case 2: Modern Performance Monitoring
Case 2: summary
Quickly identified SAN was not the root problemIdentified exact time of problem manifestation – helped identify the root cause: poorly designed database queryQuickly restored normal business operationsCustomer acknowledgement: without deep SAN monitoring/analysis solution, it would have taken days and many unproductive efforts to resolve
Where do you stand?
Are your networks being planned with the appropriate timely information or are the just happening?How are you monitoring performance? Do you know if your response times are degrading? Are your queue depth settings correct? How would you react to a brown out?What would the impact be to your business of response times that were 6000% longer than you are seeing now due to errors or congestion?Does your monitoring alert you to conditions that are irrelevant while not informing you of conditions that are likely to impact your business?Are you flying blind in when comes to the health and performance of your SAN?
Thank You. Questions? Or, Contact us to:Get a Finisar SAN assessment of your availability and performance needs
Walk through detailed SAN diagnostic scenarios
Schedule a web briefing for your organization
Today’s slides: www.finisar.com/webcast/NW1006.php
Thank You. Questions? Or, to contact us