multi-tier trace correlation - wireshark...wan server server server a a pc wan accel wan accel a a...
TRANSCRIPT
Multi-tier Trace Correlation Paul Offord CTO, Advance7
1
Agenda
• Context • Process-to-process communication • Multi-tier traffic patterns • Your questions • Practical 1 – Timeframe and time accounting • Your questions • Correlation strategies • Final questions • Closing remarks
2
The Enemy
3
Recurring Gray Problem
It keeps happening
The causing technology is unknown
Performance Error
Incorrect output See Wikipedia
Recurring gray problems
4
Problem Manager
App Support
Data Networks
Server Support
Database Support
SoluDon Architects
?
Desk Support
Discovery
5
SoFware engineering principles
Standard IT diagnosDc tools and techniques
Enter fromPM Process
Gain detailed & accurate understanding
of problem symptoms
Agreedunderstanding?
Exit toPM Process
Chooseone symptomto investigate
Share, gather,explain & sort
Agree diagnosticobjective and plan
capture of definitive data
Gain accurate understanding of the
symptom environment
Execute thediagnostic capture plan
Analyse thecaptured diagnostics
No
Work with the owningSupport Team to determine
the fix
Implement the fixand re-activate the
diagnostic capture plan
Translate diagnostic data and present to the
Support Team owning the RC technology
No
No Root Cause identified? Yes Fixed?
NewRoot Cause
?No
YesYes
Adequatediagnostics
?
Yes
No
Analysecaptured data
Review thecaptured diagnostics
Is Quality Acceptable?
Yes
No
No
RPR method
RPR Principles
• Achieve Root Cause Identification (RCI)
• Focus on a single symptom
• Capture individual instances
• Use Definitive Diagnostic Data
• Capture in production
6
Performance – What happened?
7
WAN Server
Server
Server
AA
PC WAN Accel
WAN Accel
A
A
1.0s 0.2s 0.2s 0.3s 0.1s
0.4s
12.8s
User experiences 15s response Dme
Error – What happened?
8
WAN Server
Server
Server
AA
PC WAN Accel
WAN Accel
A
A
User receives an error message
Incorrect interacDon
Process-to-process communication
9
Client Process
Server Process
Connect
Disconnect
Time
Increasin
g
Data Transfer
TCP Ports
Request-response Pairs
10
Client Network Server
Request
Response
Time
Increasin
g
Service
Time
Note: Messages not packets
Client-Server Chains
11
Slow Response – Scenario 1
12
Time increasing
Req
Rsp
10 seconds
Req
Rsp
9.5 seconds
C SS
ReqRsp
DatabaseWeb Server
Slow Response – Scenario 2
13
ReqRsp
10 seconds
Req
Rsp
ReqRsp
C SS
Database
9.5 seconds
Time increasing
Web Server
Response Time Elements
• Client time
• Service time
• Request spread
• Response spread
14
Client and Service Time
15
Req
Rsp
Req
Rsp
Req
Rsp
C SS
Web Database
Service TimeClient Time
Spread
16
Req – Part a
Rsp – Part α
Req
Rsp – Part 1
Rsp – Part β
C SS
Web Database
Service TimeClient Time
Rsp – Part 2
Req – Part bReq – Part cReq – Part d
Request Spread
Rsp – Part γResponse Spread
Response Spread
Time increasing
Break for…
17
QuesDons?
Protocol Message vs. Packets
18
BeVer filter expression tcp.port==80 && (tcp.len>0 || tcp.flags.syn==1) && !tcp.analysis.retransmission
Eliminates TCP Keep-‐alive packets Or tcp.port==80 && (tcp.len>1 || tcp.flags.syn==1) && !tcp.analysis.retransmission
Ignore retransmissions Detect
connect delays Remove ACKs
Messages to service
What about interleaved streams?
19
We’ll deal with this later
TMS Problem
• Simple workflow system
• Web browser, web server and database
• List of work items called tickets
• Click on ticket to display detail – Response time < 1 second
• Intermittent response time of 5+ seconds 20
Recurring Gray
Problem
TMS Slow Response Time
21
TMS HTTP Trace
22
Linux
PC Network Web Server
TMS App
Database Server
TCP Port 80
A
Request to TCP Port 80
Response from
TCP Port 80
Think Dme Service Time
Time delta for last request pkt
to first response pkt
Approx. response Dme of 6 seconds
HTTP Response Time
23
Request to TCP Port 80
Response from TCP Port 80
11:42:36.622843
11:42:42.770757
Service Time of 6.148s
Last request pkt to
first response pkt
Time Accounting
24
Linux
PC Network Web Server
TMS App Data base
6.148s
A
< 1s
Break for…
25
QuesDons?
TMS Database Trace – Scenario 1
26
Linux
PC Network Web Server PHP App
Database Server
TCP Port 80 TCP Port 5432
A A
Request to TCP Port 80
Response from TCP Port 80
11:42:36.622843
11:42:42.770757
Req to 5432 Rsp from 5432
Req to 5432 Rsp from 5432
Service Time
Client Time
Timeframe
Database Response Time
27
Req to 5432 Rsp from 5432
Req to 5432 Rsp from 5432
0.5 + 28.8 ms
6.001s
Request to TCP Port 80
Response from TCP Port 80
11:42:36.622843
11:42:42.770757
Break for…
28
QuesDons?
Sort by TCP Connection
Use the quadruplet:
ClientIP:ClientPort:ServiceIP:ServicePort
29
Determining the Client Port
30
Calculating Service Time
31
Calculating Client Time
32
Client Time Scatter Plot
33
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
0 5 10 15 20 25 30
Client Tim
e (m
s)
Client Time for Database Trace Mouse over this to get spreadsheet row number and hence trace frame number
Server Time Scatter Plot
34
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Service Time (m
s)
Service Time for Database Trace
Updated Time Account
35
Linux
PC Network Web Server
TMS App Data base
6.089s
A
< 1s 0.059s
6.148s
Other services?
A
Most work but some don’t
36
Protocol Flip-‐Flop
Web (HTTP and HTTPS) Yes
Web Services (e.g. .NET RemoDng, WCF) Yes
Other RPC (e.g. Java RMI, MSRPC) Yes
Database (e.g. MicrosoF1, Sybase, Oracle) Yes
File Server (SMB2, SMB23, NFS) Yes
Many proprietary protocols Yes
Citrix ICA No
Windows Terminal Server RDP No
1. MARS may have to be considered 2. Further sort criteria need to be considered 3. Further sort criteria need to be considered
37
What about clock sync?
See the RPR book
Break for…
38
QuesDons?
Correlation Strategies
39
• Don’t need to
• Port-to-port mapping
• Based on data content
• Based on characterization
No Need - Scenario
40
Under load one transacDon intermiVently gave a 60+ second response Dme
HTTPServer
Customer Presentation
Server(WAS)
Siteminder Policy Server
HTTPServer
WebSphereApplication
Server
Oracle Database
LoadInjector
41
No Need - Analysis
HTTPServer
HTTPS
Customer Presentation
Server(WAS)
Siteminder Policy Server
HTTPServer
WebSphereApplication
Server
HTTPS
HTTPS
HTTPS Total for all response times (hundreds of them) during the 48.613-‐second timeframe is0.5 seconds
62.291
61.887s
48.613s
2s
11s
HTTPS
62.300s
10:25:35.016
10:26:36.904
Oracle Database
HTTPS
42
No Need – Further elimination
HTTPServer
HTTPS
Customer Presentation
Server(WAS)
Siteminder Policy Server
HTTPServer
WebSphereApplication
Server
HTTPS
HTTPS Total for all WAS response times during the 61.887-‐second timeframe is1.162 secs
62.291
61.887s
HTTPS
62.300s
10:25:35.016
10:26:36.904
Oracle Database
HTTPS
SP & SQL/TNSTotal for all database response times in
61.887-‐second timeframe is 1.181 secs
SP & SQL/TNS
43
Port-to-port Mapping
WANPC
XenAppFile
Serverwith User’s Home
Directory
A AA Trace A Trace BTrace C
ICA Session StartICA Traffic
SMB Tree Connect
192.168.3.22192.168.9.67 192.168.1.38 192.168.3.8
192.168.3.22:47006 192.168.3.8:445
192.168.3.9:8276 192.168.1.38:2598
User fredblogs starts the Citrix client
A short Dme later the XenApp server connects to \\mainfs\home\fredblogs
44
Content Matching
TMS Database Trace – Scenario 2
45
Linux
PC Network Web Server PHP App
Database Server
TCP Port 80 TCP Port 5432
A A
Req to 5432 Rsp from 5432
Request to TCP Port 80
Response from TCP Port 80
11:42:36.622843
11:42:42.770757
Req to 5432 Rsp from 5432
Service Time
Client Time
Database Response Time
46
Does this relate to our slow transacDon?
6.029s
Request to TCP Port 80
Response from TCP Port 80
11:42:36.622843
11:42:42.770757
11:42:36.733601
11:42:42.762467
Content Matching - Response
47
Linux
PC Network Web Server
TMS App Data base A A
PSG Create -‐ CommunicaDons PSG Create -‐ CommunicaDons
48
Data Content - Response
Content Matching - Request
49
Linux
PC Network Web Server
TMS App Data base A A
TicketNo=511129 511129
50
Data Content - Request
Therefore This slow database transacDon relates to the web transacDon
51
Characterization
Time increasing
Req Type A
Rsp Type A
Req Type 1
Rsp Type 1
C SS
Req Type BRsp Type B
Req Type V
Rsp Type V
Req Type 2
Rsp Type 2
App Server Database
Resources
52
Book RPR: A Problem Diagnosis
Method for IT Professionals GiF today or from Amazon or Lulu
White Paper Network Trace Analysis Strategies from www.advance7.com
Video RPR NA03: Analysing SQL Server performance using Wireshark and Excel from YouTube
More Resources
53
Forum RPR PracDDoners from www.linkedin.com
Video RPR NA02: Analysing SMB2 and fileserver performance from YouTube
Video RPR NA01: Analysing fileserver
performance using Wireshark and Excel from YouTube
54
QuesDons?
55
Cloud
SaaS
PaaS BPO
OperaDon costs Revenue
IT cap-‐ex
Recurring Gray Problems
56
The issue will grow
You have the skills & techniques to make the difference
Only evidence-‐based methods will help
It will slow development of the industry
Only
57 Lead the way
Thank you
58
Paul Offord Chief Technical Officer
Advance7
e: [email protected] p: + 44 1371 876 805 t: @paulofforda7
For book or e-‐book contact: Rachel D’Cruze e: [email protected] p: + 44 1371 876 805