recent linux tcp updates, and how to tune your 100g...
TRANSCRIPT
RecentLinuxTCPUpdates,andhowtotuneyour100GhostNateHanford,BrianTierney,[email protected]://fasterdata.es.net
Internet2TechnologyExchange,
Sept27,2016,Miami,FL
Observation#1
• TCPismorestableinCentOS7vs CentOS6– Throughputrampsupmuchquicker
• Moreaggressiveslowstart– Lessvariabilityoverlifeoftheflow
10/24/162
BerkeleytoAmsterdam
10/24/164
NewYorktoTexas
Observation#2
• TurningonFQhelpsthroughputevenmore– TCPisevenmorestable– Worksbetterwithsmallbufferdevices
• Pacingtomatchbottlenecklinkworksbetteryet
10/24/165
NewTCPoption:FairQueuingScheduler (FQ)AvailableinLinuxkernel3.11(releasedlate2013)orhigher
– availableinFedora20,Debian 8,andUbuntu13.10– Backported to3.10.0-327kernel inv7.2CentOS/RHEL(Dec 2015)
ToenableFairQueuing(whichisoffbydefault),do:– tc qdisc adddev $ETHrootfq
Tobothpaceandshapethebandwidth:– tc qdisc adddev $ETHrootfq maxrate Ngbit
• Canreliablypaceuptoamaxrate of32Gbps
http://fasterdata.es.net/host-tuning/linux/fair-queuing-scheduler/
FQBackground
• Lotsofdiscussionaround‘bufferbloat’startingin2011– https://www.bufferbloat.net/
• Googlewantedtobeabletogethigherutilizationontheirnetwork– Paper:“B4:ExperiencewithaGlobally-DeployedSoftwareDefinedWAN,SIGCOMM2013
• GooglehiredsomeverysmartTCPpeople• VanJacobson,MattMathis,EricDumazet,andothers
• Result:LotsofimprovementstotheTCPstackin2013-14,includingmostnotablythe‘fairqueuing’pacer
10/24/167
10/24/168
NewYorktoTexas:WithPacing
100GHostTuning
10/24/169
TestEnvironment• Hosts:
– Supermicro X10DRiDTNs– IntelXeonE5-2643v3,2sockets,6coreseach– CentOS 7.2runningKernel3.10.0-327.el7.x86_64– Mellanox ConnectX-4EN/VPI100GNICswithportsinENmode– Mellanox OFEDDriver3.3-1.0.4(03Jul2016),Firmware12.16.1020
• Topology– BothsystemsconnectedtoDellZ9100100GbpsONTop-of-RackSwitch– Uplinktonersc-tb1ALUSR7750Routerrunning100GlooptoStarlightandback
• 92msRTT– UsingTagged802.1qtoswitchbetweenLoopandLocalVLANs– LANhad54usecRTT
• Configuration:– MTUwas9000B– irqbalance,tuned,andnumad wereoff– coreaffinitywassettocores7and8(ontheNUMAnodeclosesttotheNIC)– AlltestsareIPV4unlessotherwisestated
10/24/1610
nersc-tb1
Dell z9100
nersc-tbn-4 nersc-tbn-5
star-tb1
100G loop: RTT = 92ms
100G
StarLight (Chicago)
Oakland, CA
Each host has:• Mellanox ConnectX-4 (100G)• Mellanox ConnectX-3 (40G)
Alcatel 7750 Router
TestbedTopologyAlcatel 7750 Router
40G
100G 100G40G
OurCurrentBestSingleFlowResults• TCP
– LAN:79Gbps– WAN(RTT=92ms):36.5Gbps,49Gbpsusing‘sendfile’API(‘zero-copy’)– Testcommands:
• LAN:nuttcp -i1-xc7/7–w1m-T30hostname• WAN:nuttcp -i1-xc7/7–w900M-T30hostname
• UDP:– LANandWAN:33Gbps– Testcommand:nuttcp -l8972-T30-u-w4m-Ru -i1-xc7/7hostname
Othershavereportedupto85GbpsLANperformancewithsimilarhardware
10/24/1612
CPUgovernorLinuxCPUgovernor(P-States)settingmakesabig difference:
RHEL: cpupower frequency-set -g performanceDebian:cpufreq-set -r -g performance
57Gbps defaultsettings(powersave)vs.79Gbps ‘performance’modeontheLANTowatchtheCPUgovernorinaction:watch -n 1 grep MHz /proc/cpuinfo
cpu MHz:1281.109cpu MHz:1199.960cpu MHz:1299.968cpu MHz:1199.960cpu MHz:1291.601cpu MHz:3700.000cpu MHz:2295.796cpu MHz:1381.250cpu MHz:1778.492
10/24/1613
CPUfrequency• Driver:KernelmoduleorcodethatmakesCPUfrequencycallstohardware• Governor:Driversettingthatdetermineshowthefrequencywillbeset• PerformanceGovernor:Biastowardshigherfrequencies• Userspace Governor:Allowusertospecifyexactcoreandpackagefrequencies• OnlytheIntelP-StatesDrivercanmakeuseofTurboBoost• Checkcurrentsettings:cpupower frequency-info
10/24/1614
P-StatesPerformance
ACPI-CPUfreqPerformance
ACPI-CPUfreqUserspace
LAN 79G 72G 67G
WAN 36G 36G 27G
TCPBuffers# add to /etc/sysctl.conf
# allow testing with 2GB buffers
net.core.rmem_max = 2147483647net.core.wmem_max = 2147483647
# allow auto-tuning up to 2GB buffers
net.ipv4.tcp_rmem = 4096 87380 2147483647net.ipv4.tcp_wmem = 4096 65536 2147483647
2GBisthemaxallowableunderLinuxWANBDP=12.5GB/s*92ms=1150MB(autotuningsetthisto1136MB)LANBDP=12.5GB/s*54us=675KB(autotuningsetthisto2-9MB)ManualbuffertuningmadeabigdifferenceontheLAN:
– 50-60Gbpsvs 79Gbps
10/24/1615
zerocopy (sendfile)results
• iperf3–Zoption
• NosignificantdifferenceontheLAN
• SignificantimprovementontheWAN– 36.5Gbpsvs 49Gbps
10/24/1616
IPv4vs IPv6results
• IPV6isslightlyfasterontheWAN,slightlyslowerontheLAN
• LAN:– IPV4:79Gbps– IPV6:77.2Gbps
• WAN– IPV4:36.5Gbps– IPV6:37.3Gbps
10/24/1617
Don’tForgetaboutNUMAIssues
10/24/1618
• Upto2xperformancedifferenceifyouusethewrongcore.
• Ifyouhavea2CPUsocketNUMAhost,besureto:– Turnoffirqbalance– FigureoutwhatsocketyourNICisconnectedto:
cat /sys/class/net/ethN/device/numa_node
– RunMellanox IRQscript:/usr/sbin/set_irq_affinity_bynode.sh 1 ethN
– BindyourprogramtothesameCPUsocketastheNIC:numactl -N 1 program_name
• WhichcoresbelongtoaNUMAsocket?– cat/sys/devices/system/node/node0/cpulist– (note:onsomeDellservers,thatmightbe:0,2,4,6,...)
SettingstoleavealoneinCentOS7
Recommendleavingtheseatthedefaultsettings,andnoneoftheseseemtoimpactperformancemuch
• InterruptCoalescence
• RingBuffersize
• LRO(off)andGRO(on)
• net.core.netdev_max_backlog
• txqueuelen
• tcp_timestamps
10/24/1619
ToolSelection• Bothnuttcp andiperf3havedifferentstrengths.
• nuttcp isabout10%fasteronLANtests
• iperf3JSONoutputoptionisgreatforproducingplots
• Useboth!Botharepartofthe‘perfsonar-tools’package– Installationinstructionsat:http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/
10/24/1620
OSComparisons• CentOS7(3.10kernel)vs.Ubuntu14.04(4.2kernel)vs.Ubuntu16.04(4.4kernel)
– Note:4.2kernelareabout5-10%slower(senderandreceiver)
• SampleResults:• CentOS7toCentOS7:79Gbps
• CentOS7toUbuntu14.04(4.2.0kernel):69Gbps
• Ubuntu14.04(4.2)toCentOS7:71Gbps
• CentOS7toUbuntu16.04(4.4kernel):73Gbps
• Ubuntu16.04(4.4kernel)toCentOS7:75Gbps
• CentOS7toDebian 8.4with4.4.6kernel:73.6G
• Debian 8.4with4.4.6KerneltoCentOS7:76G
10/24/1621
BIOSSetting
• DCA/IOAT/DDIO:ON– AllowstheNICtodirectlyaddressthecacheinDMAtransfers
• PCIe MaxReadRequest:Turnitupto4096,butourresultssuggestitdoesn’tseemtohurtorhelp
• Turboboost:ON
• Hyperthreading:OFF– AddedexcessivevariabilityinLANperformance(51Gto77G)
• node/memoryinterleaving:??
10/24/1622
PCIBusCommandsMakesureyou’reinstallingtheNICintherightslot.Usefulcommandsinclude:
FindyourPCIslot:lspci | grep Ethernet
81:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
ConfirmthatthisslotisPCIeGen3x16:lspci -s 81:00.0 -vvv | grep PCIeGen
[V0] Vendor specific: PCIeGen3 x16
ConfirmthatPCIMaxReadReq is4096Blspci -s 81:00.0 -vvv | grep MaxReadReq
MaxPayload 256 bytes, MaxReadReq 4096 bytes
Ifnot,youcanincreaseitusing‘setpci’
• Formoredetails,see:https://community.mellanox.com/docs/DOC-2496
10/24/1623
Benchmarkingvs.ProductionHostSettings
Therearesomesettingsthatwillgiveyoumoreconsistentresultsforbenchmarking,butyoumaynotwanttorunonaproductionDTNBenchmarking:• UseaspecificcoreforIRQs:
/usr/sbin/set_irq_affinity_cpulist.sh 8 ethN• Useafixedclockspeed(settothemaxforyourprocessor):
– /bin/cpupower -c all frequency-set -f 3.4GHzProductionDTN:
/usr/sbin/set_irq_affinity_bynode.sh 1 ethN/bin/cpupower frequency-set -g performance
10/24/1624
FQon100GHosts
10/24/1625
100GHost,ParallelStreams:nopacingvs 20Gpacing
10/24/1626
WealsoseeconsistentlossontheLANwith4streams,nopacingPacketlossduetosmallbuffersinDellZ9100switch?
100GHostto10GHost
10/24/1627
FastHosttoSlowhost
10/24/1628
Throttledthereceivehostusing‘cpupower’command:/bin/cpupower -c all frequency-set -f 1.2GHz
Summaryofour100Gresults
• NewEnhancementstoLinuxKernelmaketuningeasieringeneral.
• Afewofthestandard10Gtuningknobsnolongerapply
• TCPbufferautotuningdoesnotworkwell100GLAN
• Usethe‘performance’CPUgovernor
• UseFQPacingtomatchreceivehostspeedifpossible
• ImportanttobeusingtheLatestdriverfromMellanox– version:3.3-1.0.4(03Jul2016),firmware-version:12.16.1020
10/24/1629
What’snextintheTCPworld?
• TCPBBR(BottleneckBandwidthandRTT)fromGoogle– https://patchwork.ozlabs.org/patch/671069/– GoogleGroup:https://groups.google.com/forum/#!topic/bbr-dev
• AdetaileddescriptionofBBRwillbepublishedinACMQueue,Vol.14No.5,September-October2016:– "BBR:Congestion-BasedCongestionControl".
• Googlereports2-4ordersofmagnitudeperformanceimprovementonapathwith1%lossand100msRTT.– Sampleresult:cubic:3.3Mbps,BBR:9150Mbps!!– EarlytestingonESnetlessconclusive,butseemstohelponsomepaths
10/24/1630
InitialBBRTCPresults(bwctl,3streams,40sectest)RemoteHost Throughput Retransmits
perfsonar.cc.trincoll.edu htcp:4.52bbr:410
htcp:266bbr:179864
perfsonar.nssl.noaa.gov htcp:183bbr:803
htcp:1070bbr:240340
kstar-ps.nfri.re.kr htcp:4301bbr:4430
htcp:1641bbr:98329
ps1.jpl.net htcp:940bbr:935
htcp:1247bbr:399110
uhmanoa-tp.ps.uhnet.net htcp:5051bbr:3095
htcp:5364bbr:412348
10/24/1631
MoreInformation
• https://fasterdata.es.net/host-tuning/linux/fair-queuing-scheduler/• https://fasterdata.es.net/host-tuning/40g-tuning/• TalkonSwitchBuffersizeexperiments:
– http://meetings.internet2.edu/2015-technology-exchange/detail/10003941/
• Mellanox TuningGuide:– https://community.mellanox.com/docs/DOC-1523
• Email:[email protected]
10/24/1632
ExtraSlides
10/24/1633
mlnx_tune command
• See:https://community.mellanox.com/docs/DOC-1523
10/24/1634
Coalescing Parameters
• Varies by manufacturer• usecs: Wait this amount of microseconds after 1st packet is
received/transmitted• frames: Interrupt after this many frames are received or transmitted• tx-usecs and tx-frames aren’t as important as rx-usecs• Due to the higher line rate, lower is better, until interrupts get in the way
(at 100G, we are sending almost 14 frames/usec• Default settings seem best for most cases
8/30/201635
AsmallamountofpacketlossmakesahugedifferenceinTCPperformance
MetroArea
Local(LAN)
Regional Continental
International
Measured(TCPReno) Measured(HTCP) Theoretical(TCPReno) Measured(noloss)
Withloss,highperformance beyondmetrodistancesisessentiallyimpossible
TCP’sCongestionControl
© 2015 Internet2
50ms simulated RTTCongestion w/ 2Gbps UDP trafficHTCP / Linux 2.6.32
SlidefromMichaelSmitasin,LBLnet
FQPacketsaremuchmoreevenlyspacedtcptrace/xplotoutput:FQonleft,StandardTCPonright
38
FairQueuingandandSmallSwitchBuffersTCPThroughputonSmallBufferSwitch(Congestionw/2GbpsUDPbackgroundtraffic)
RequiresCentOS 7.2orhigher
tc qdisc add dev EthN root fqEnableFairQueuing
PacingsideeffectofFairQueuingyields~1.25Gbpsincreaseinthroughput@10Gbpsonourhosts
TSOdifferencesstillnegligibleonourhostsw/IntelX520
SlidefromMichaelSmitasin,LBL
ESnet’s100GSDNTestbedTopology
40
ESnet
ESnet
ESnet
ESnet
ESnet
ESnet
ESnet
ESnet
ESnet
ESnetDedicated 100G
2 x 10G2 x 40G
8 x 10G
2 x 10G
5 x 10G
2 x 10G
SDN POP
SDN POP
SDN POP
100G POP
SDN POP
2 x 10G
SDN POP
8 x 10G
SDN POP
5 x 10G
SDN POP
100G
Oakland, California
Denver, Colorado
Chicago, Illinois
New York, New York
Washington, D.C.
Atlanta, Georgia
Amsterdam, Netherlands
CERN, Geneva Switzerland
exoGENI Rack
1 x 10G
1 x 40G
2 x 10G 2 x 40G
4 x 10G
4 x 10G
2 x 10G
4 x 10G
2 x 10G
2 x 10G
6 x 10G
5 x 10G
exoGENI Rack 40G
2 x 10G
2 x 10G
OpenFlow 1.0
OpenFlow 1.0
SDN Switch
SDN Switch
SDN Switch
SDN Switch
SDN Switch
SDN Switch
SDN Switch
denv-cr5
SDN POPreserved for SDX
2 x 10G
OpenFlow 1.0
2 x 10G
ESnet 100G SDN Testbed
3 x 10G
Berkeley, California
Argonne
TestbedAccess• Proposalprocesstogainaccessdescribedat:• http://www.es.net/RandD/100g-testbed/proposal-process/• Testbedisavailabletoanyone:
– DOEresearchers– Othergovernmentagencies– Industry
• MustsubmitashortproposaltoESnet (2pages)• ReviewCriteria:
– Project“readiness”– Couldtheexperimenteasilybedoneelsewhere?
41