Download - December, 2006F I N I S A R C O R P O R A T I O N Finisar Corporation Monitoring Performance

December, 2006 F I N I S A R C O R P O R A T I O N

Finisar Corporation

Monitoring PerformanceMonitoring Performance

complexitycomplexity

heterogeneityheterogeneity

virtualizationvirtualization

changechange

fabric blindness

fabric blindness

The growing SAN challenge

you don’t know what to do when things go wrong

you don’t know the source of

SAN issues

you don’t know what you can’t see

you don’t know what to do when things go wrong

you don’t know the source of

SAN issues

you don’t know what you can’t see

Fabric blindness leads to…

Application brownouts or blackouts occur - and have significant business impactbusiness impact

Frantic fire-fighting

Internal finger-pointing Application vs. network vs. storage

External finger-pointing Vendors

Unacceptably long resolution times

Information Highway

Network performance shares many similarities with that of your daily commute

Storage Area Networks are no exception

Just like a large sprawling city, as SAN’s grow performance becomes more difficult to ensure

Lets take a look at planning for a faster commute

Is Performance Important?

Bugatti Veyron – The fastest production car 0 – 60 mph in 3.2 seconds Top speed well over 200 mph Price more than $1,000,000

Chevy Matiz – One of the slowest cars 0 – 60 mph in 21.9 seconds Top speed about 85 mph Price about $10,000

The Difference 6.8 times the acceleration 3 – 4 times as fast More than 10 times the price

The Real Difference

In this environment they both go they same speed.

In fact in most environments they would have roughly the same time from A to B.

So maybe the right question is when is performance important and how is it measured

Rush Hour

In LA which has the worst rush hour commute time there is an 81% average delay during rush hour

Often certain routes are congested while others have limited traffic that is not affected

On many SANs there is a 500% average delay during peak times

There is no notification of a problem (time out) until it is at 6000% of normal maximums and 75,000% of the low load average

Queues (just like on ramps) can fill even at low bandwidth conditions

Often certain routes are congested while others have limited traffic that is not affected

Rush Hour

The impact of accidents

The impact of accidents depends on their severity

Pileups can result in routes that are impassable

Minor accidents can cause delays that far exceed even the impact of rush hour

The impact of errors

The impact of errors depend on the severity of the issue

Physical errors can result in routes that are unusable

Occasional errors can cause delays that far exceed even the impact of rush hour

Patch Work

Often short term solutions to problems become long term hazards

Planning for and monitoring the commute

City planners architect the roadways for what they believe will be the commute demands

In some cases they use simulation to compare various alternatives

Finally they monitor the traffic patterns to prevent and resolve problems and better plan for the future

SAN Architects plan the fabrics for what they believe will be the storage demands

In some cases they use simulation and tests to compare various alternatives

Finally they monitor the traffic patterns to prevent and resolve problems and better plan for the future

Planning for and monitoring the SAN

Planning

The roadways are designed for the expected traffic loadsOften one of the biggest mistakes in the planning is using information that is out of date or incorrect assumptions.

Planning

The Fabrics are designed for the expected traffic loads

Often one of the biggest mistakes in the planning is using information that is out of date or incorrect assumptions.

Simulations are sometimes used to compare changes

Monitoring: I/O’s Per Second

Which route has more cars passing by every second?

In this scenario they could all be the same…Some with a few cars moving very fast while others with many cars that are going slowSo what if anything does that measurement tell us about performance?

Which route has more MB’s passing through every second

In this scenario they could all be the same…Some with no requests and some with slow request due to congestion

So what if anything does that measurement tell us about performance?

Monitoring: I/O’s Per Second

Looks at the real traffic flowsCan assess performancePinpoints the source of slow downs such as accidents and congestionSpeeds resolution to many of the problemsIn many cases helps to prevent issues from becoming problems

Modern Monitoring

Different method of Network Monitoring

Software Monitoring

Software Monitoring No interfering on the physical link Software Agent needed Effected by host system performance

Hardware Monitoring Isolate from Software and Host issue Intrusive on the physical link Dedicated monitoring HW.

Modern Monitoring

single TAP

Looks at the real traffic flowsCan assess performancePinpoints the source of slow downs such as accidents and congestionSpeeds resolution to many of the problemsIn many cases helps to prevent issues from becoming problems

Performance Analysis and Tuning

Request size and Queue dept are two keys contribution to performance tuning

Pre-Production run with variable queue dept and request size. Higher Queuept could increased throughput but also could cause

congestion and reduce throughput

Queue = 2

Queue = 4

Queue = 8

Queue = 16


Read size 8 Kb with variable queue dept setting. Response time range from 10ms to 65ms. The ideal Queue dept for this system would be at 8 with 8Kb i/o


Queue dept of 4 with variable read size

Throughput gain at the expense of latency

At 32k I/O throughput gain is no longer keeping up with the latency

Good Performance Monitoring

Does not focus on the irrelevant

Alarm for know issues

Unless there is an increasing pattern

Effects of SAN performance monitoring

Eliminate internal and vendor finger-pointing

Receive advance warning ofpotential problems

Reduce business riskriskrisk

Two recent customer case studies

Case 1A SAN problem was the root cause of an application disruption

Case 1A SAN problem was the root cause of an application disruption

Case 2A SAN problem was suspected as the root cause of an application disruption - but it was not the cause

Case 2A SAN problem was suspected as the root cause of an application disruption - but it was not the cause

Case 1: company profile

Large US insurance firm

Broad offering of insurance and financial products

10,000+ agents and employees

Large Microsoft Exchange implementation

Exchange data replicated to a remote site for backup and disaster recovery

Case 1: customer crisis

Exchange application slowed and became essentially unusable

User complaints flood IT

Business operations adversely impacted

Case 1: resolution efforts

Exchange server event log - no problemsStorage arrays log file - no problemsPrimary and secondary DR links tested - no problemsSwitch fabric manager - no problemsExchange throughput still low - pressure mounting - but no way to diagnose the problem. Elapsed time = 8+ hours

Case 1: Modern Performance monitoring

Probed storage link – unusually high Exchange Completion Times – proved SAN is the problemStorage array response – goodRemote replication acknowledgments – too long

Solution – re-route the DR traffic through secondary link – Exchange performance restored. Elapsed time = 30 minutesCause and Fix – Remote switch was busy dealing with RSCN storm because of a bad HBA in a unrelated application server in the remote site – Replaced HBA

Sync replication impact on production

Remote replication enabled

Remote replication disabled

Case 1: summary

Normal business operations were quickly restoredConclusive data that prevented finger-pointingWithout deep SAN Performance monitoring/analysis: it would have taken extraordinary effort to get to the root cause and resolutionIf deep SAN monitoring/analysis was in place: problem would have been prevented

Case 2: company profile

Large UK financial services firm

Assets of £540 billion

Over 20 million customers

Major UK mortgage and savings provider and credit card issuer

Relies on Oracle databases for transaction processing systems

Case 2: problem statement

Sudden, but intermittent slow down of Oracle-based applications

Widespread user complaints driving high level of internal visibility

Business operations adversely impacted

SAN was assumed to be the problem

Case 2: Modern Performance Monitoring

Deep SAN monitoring/analysis solution already in place

Quickly determined that all SAN parameters were within normal ranges - problem was not within the SAN

Trending report indicated time of problem occurrence - IT tracked back to an application “enhancement”

Elapsed time = <30 mins

Increased link traffic


Case 2: summary

Quickly identified SAN was not the root problemIdentified exact time of problem manifestation – helped identify the root cause: poorly designed database queryQuickly restored normal business operationsCustomer acknowledgement: without deep SAN monitoring/analysis solution, it would have taken days and many unproductive efforts to resolve

Where do you stand?

Are your networks being planned with the appropriate timely information or are the just happening?How are you monitoring performance? Do you know if your response times are degrading? Are your queue depth settings correct? How would you react to a brown out?What would the impact be to your business of response times that were 6000% longer than you are seeing now due to errors or congestion?Does your monitoring alert you to conditions that are irrelevant while not informing you of conditions that are likely to impact your business?Are you flying blind in when comes to the health and performance of your SAN?

Thank You. Questions? Or, Contact us to:Get a Finisar SAN assessment of your availability and performance needs

Walk through detailed SAN diagnostic scenarios

Schedule a web briefing for your organization

Today’s slides: www.finisar.com/webcast/NW1006.php

Thank You. Questions? Or, to contact us

Download - December, 2006F I N I S A R C O R P O R A T I O N Finisar Corporation Monitoring Performance

Top Related