lessons learned from real life

1

Lessons Learned from Real Life

Jeremy Elson

National Institutes of Health

November 11, 1998

2

quick bio• Hi, I’m Jeremy. Nice to meet you.

• 1996: BS Johns Hopkins, Comp Sci

• Sep 96 - Sep 98: Worked at NIH full-time– Led software development effort on a small team

developing an ATM-based telemedicine system called the Radiology Consultation WorkStation (RCWS)

• Sep 98: Decided to return to school full-time

• Nov 98: Gave a talk to dgroup about interesting lessons learned during development of the RCWS

3

my talk

• Very quick description of the RCWS

– In future dgroups, I can give a talk about the RCWS, or about ATM, if there is interest

• Some pitfalls and fallacies in networking I discovered while developing the RCWS

• Techniques for network problem solving

4

Radiology Consultation Workstation Network

5

RCWS Block Diagram

6

an unintended test

• Initial Configuration: 2 Sparc 20’s w/50MHz CPUs; Solaris 2.5.1; Efficient Networks ATM NICs @155MHz, LattisCell 10114-SM– TTCP memory-to-memory: 60 Mbps

• Upgrade to 75MHz chips, otherwise identical– TTCP now reports 90Mbps!

• 50% upgrade in CPU speed led to exactly 50% increase in network throughput

7

pitfall: infinite CPU• In many systems, the network is the bottleneck;

we have “infinite” CPU in comparison. We try to use CPU to save network bandwidth:– Compression– Multicast– Caching (sort of)

– Micronet design

• Pitfall: Assuming this is always true. In our ATM app, compression might slow it down!

8

a surprising outcome• There are various ways of doing IP over ATM

– “Classical IP” MTU ~9K

– “LANE” MTU 1500 bytes (for Ethernet bridging)

• Which would you expect would have better bulk TCP performance, and by how much?

• Classical IP did better -- by a factor of ~5! I didn’t believe it at first.

• Turned out that both were sending roughly the same packets/sec; CLIP: more bytes/packet

9

pitfall: networks run out of bandwidth first

• The number of bytes per second is only one metric; consider packets per second also. This is sometimes the wall you hit first.

• Fixed packet processing cost appears to far outweigh the incremental cost to transmit more bytes as part of the same packet

• This fits nicely with the previous observation: CPU is only fast enough for n packets/sec

• This is old news to Cisco, backbone ISPs, etc.

10

pathological networks

• We built an on-campus ATM network and bought access to a MAN (ATDnet), but the only WAN available was the ACTS satellite

• Our network was very long and very fat: OC3 (155 Mb/sec) over satellite (500ms RTT).

• We were expecting standard LFN-related problems; the solutions are fairly well-known (window scaling, PAWS, SACK, etc.)

• What surprised me was something else: interactive performance!

Earth

ACTS Satellite

Request

To perform actions suchas screen updates, requestsmust go through a server.Therefore the user responsetime will be ~RTT.

Reply1/8 of a second fromEarth to a geostationarysatellite; RTT ~1/2 second(plus ground switching delay & queuing delay)

12

the best laid plans• Requests are small messages (<100 bytes)

transmitted using TCP over ATM

• Everything seemed to work fine on-campus

• Over the satellite, we were expecting to see delays of 1/2 sec in command execution

• Instead we saw >1 second delays: much more than we were expecting & hard to use. Uh oh.

• My job (with 2 hours of satellite time ticking away…): figure out why this was happening

13

the answer: tcpdump

• ‘tcpdump’ is a packet-sniffer written by Steve McCanne, Craig Leres, and Van Jacobson at LBL

• Monitors a LAN in realtime; prints info about each packet (source/dest, sequence numbers, flags, acknowledgements, options)

• Runs on most UNIX variants

• The most spectacularly fantastically wonderful network debugging tool on planet Earth; my knee-jerk reaction whenever there is any problem is to fire this up first

14

tick, tock, tick, tock...

TimeClient Server

ACK 28 Server’s TCP stack ACKs that data

28 bytesClient sends data to server

ACK 42 Server ACKs the new 42 bytes

42 bytesClient sends more data as soon

as the ACK is received

28 bytes Server application has now received a complete 70 byte message; sends reply

42 bytes Server sends new data after it receives the client’s ACK

ACK 28Client TCP stack ACKs

RTT

At the application layer, messages are 70 bytes long.

ACK 42Client TCP stack ACKs

USER SEES RESPONSE HERE

15

the nagle finagle• Each application-layer message is split into 2

segments. Why?– Because the app was calling write() twice

• For some reason, the second half isn’t sent until the first half is ACKed! Why?– The Nagle Algorithm, which says “don’t send a

tinygram if there is an outstanding tinygram.”

• Users had to wait 3 RTTs instead of 1

• Short term fix: turn off the Nagle Algorithm (setsockopt TCP_NODELAY in Solaris)

• Long term fix: rewrite the message-passing library to use writev() instead of write().

16

pitfall: don’t care how TCP and app get along

• It’s easy to think of TCP as a generic way of getting things from Here to There; sometimes, if we look deeper, we find problems

• Good example: HTTP interactions with TCP study by Touch, Heidemann & Obraczka

• Of course, different TCP implementations react differently. (Maybe some TCPs wait before launching and would have hidden this.)

17

the big mystery

• Remember: 90 Mbps Sparc 20 to Sparc 20• Scenario: Two machines doing FTP (to /dev/null)

– Machine A: Sun Ultra-1 running Solaris 2.5.1, 155 Mbps fiber ATM NIC

– Machine B: Fast Pentium-II running Windows NT 4.0, 25 Mbps UTP ATM NIC

– Using LANE, 1500 byte MTU

• Transmitting from A to B: 23 Mbps

• Transmitting from B to A: 8 Mbps!! Why?

18

tcpdump to the rescue

Time

A B

data 1460 MSS-sized segment from sender

win 64KWindow advert. from receiver

data 1460 Another MSSdata 1176 Smaller segment from sender

… more segments (not shown)

data 1460 Another segment from sender

ACK/win 64K

Receiver finally ACKsCycle starts again

long quiet time - no activity

19

observations aboutour mystery

• Sending A to B (the 22Mbps case), machine generated only MSS segments; B to A did not. (Could account for some slowdown.)

• The ACKs from A all came at very regular intervals (~50ms)

• Data came quickly (say, all in about 20ms) followed by long quiet time (say, 30ms)

• What’s going on????

20

deferred ACKs

• When we receive data, we wait a certain interval before sending an ACK

• This attempts to reduce traffic generated by interactive (keystroke) activity by hoping a new window and/or data will be ready, too

• We don’t want to do this with bulk data (defined as 3 MSS’s in a row)

21

keystrokes: the worst case

Time

User Server

Assume both sides are initially advertising Win = 100

ACK ‘a’, win 99TCP stack sends ACK

win 100 telnet daemon wakes up; reads char

‘a’ (echoed) telnet daemon sends echoed char

ACK ‘a’, win 99

TCP stack sends ACK

win 100telnet client wakes up; reads char

‘a’User types a character

22

keystrokes: what we want

Time

User Server

Assume both sides are initially advertising Win = 100

ACKa, a, win100 telnet daemon sends ACK of receivedchar, echoed char, and open window

ACK a, win 100

telnet client wakes up; reads char

‘a’User types a character

Deferred ACK interval: don’tsend an ACK right away; wait,and hope that we have a newwindow and echoed char ready

23

another look at the trace

Time

A B

data 1460 MSS-sized segment from sender

win 64KDeferred ACK interval expires

data 1176 Smaller segment from sender - whichfools the receiver into thinking that weare not doing bulk data transfer

… more segments (not shown)

data 1460 Smaller segment from sender WINDOW IS NOW CLOSED

win 64KTimer expires; receiver sends ACK

Cycle starts again

long quiet time - no activity

24

the mystery unmasked

• Only observable because all of the following were true (take out 1, the problem vanishes)– Receiver using deferred ACKs– Sender not sending all MSS sized data– Bandwidth high enough and window small enough so

that the window can be filled before the deferred ACK interval expires (rare at 10mbps)

• When I turned off the deferred ACKs on the receiver, bandwidth jumped to 23 Mbps. (Under Solaris this can be done with ndd)

25

tcpdump: our best friend• Virtually impossible to figure out problems

like the previous one by just puzzling it out

• Reading about how protocols work is a good starting point; implementing them gives you even more. But…

• Nothing gave me more intimate knowledge of TCP than seeing it come alive. Not looking at high level behavior, but actually watching packets fly across the wire

• Different stacks have different personalities

• TCP/IP Illustrated v1 is great to learn how

26

other uses of tcpdump• Keeping my ISDN router from dialing• Widespread teardrop attack on NIH (I patched

tcpdump to make this easier)• Netscape SYN bug• Samba hitting DNS• Inoculan directed broadcasts• Diagnosing dead and/or segmented networks• Even rough performance measurement• The network people thought I was a magician!

27

summary:lessons learned

I. Thou shalt not assume that thy CPU is infinite in power, for thy network mayindeed be more plentiful.

II. Thou shalt take mind of the number ofpackets thou sendeth to thy network;

for, yea, a multitude thereof may wreak havoc thereupon.

28

summary:lessons learned

III. Thou shalt read the Word of Stevens in TCP/IP Illustrated, and become learnedin the ways of tcpdump, so that thy daysof network debugging shall be pleasant and brief.

IV. Thou shalt watch carefully the packets that thy applications create, so that TCP may

be thy servant and not thy taskmaster.

29

that’s all, folks!

lessons learned from real life

Documents

network bandwidth

campus atm network

network throughputpitfall

atm app

acts satelliteour network

cpu speed

efficient networks atm

atmclassical ip mtu