hpca-10 architectural characterization of tcp/ip processing on the intel® pentium® m processor...
TRANSCRIPT
HPCA-10
Architectural Characterization of TCP/IP Processing on the Intel® Pentium® M
Processor
Srihari Makineni & Ravi IyerCommunications Technology Lab
Intel® Corp.{srihari.makineni & ravishankar.iyer}@intel.com
HPCA-10
2HPCA-10
Outline
• Motivation
• Overview of TCP/IP
• Setup and Configuration
• TCP/IP Performance Characteristics– Throughput and CPU Utilization– Architectural Characterization
• TCP/IP in server workloads
• Ongoing work
3HPCA-10
Motivation
• Why TCP/IP?– TCP/IP is the protocol of choice for data communications
• What is the problem?– So far system capabilities allowed TCP/IP to process
data at Ethernet speeds– But Ethernet speeds are jumping rapidly (1 to 10 Gbps)– Requires efficient processing to scale to these speeds
• Why architectural characterization?– Analyze performance characteristics and identify
processor architectural features that impact TCP/IP processing
4HPCA-10
TCP/IP Overview
UserUser
KernelKernel
ApplicationApplication
Sockets Interface
NICNIC
DriverDriver
TCP/IP StackTCP/IP Stack
Network Network HardwareHardware
DMA
Buffer
TCB
Desc 1ETH IP TCP
Desc 2
Tx 1 Tx 2 Tx 3
Eth Pkt 1
• Transmit
5HPCA-10
Rx 2
TCP/IP Overview
UserUser
KernelKernel
ApplicationApplication
Sockets Interface
NICNIC
DriverDriver
TCP/IP StackTCP/IP Stack
Network Network HardwareHardware
Copy
Signal/
Copy
Payload
Rx 3Rx 1
ETH IP TCP
BufferTCB
Descriptor
Eth Pkt 1
DMA
• Receive
6HPCA-10
Setup and Configuration
• Test setup– System Under Test (SUT)
• Intel® Pentium® M processor @ 1600MHz, 1MB (64B line) L2 cache
– 2 Clients• Four way Itanium® 2 processor @
1GHz; 3MB L3 (128B line) cache
System Under Test Source or Sink Test Servers
Itanium®
2 4P Platforms Pentium
® M
UP Platform
NICNICNICNIC
NICNICNICNIC
NIC NICNIC NIC
NIC NICNIC NIC
Client 1
Client 2
– Operating System• Microsoft Windows* 2003 Enterprise Edition
– Network• SUT – 4Gbps total (2 dual port Gigabit NICs)• Clients – 2Gbps per client (1 dual port Gigabit NIC)
7HPCA-10
Setup and Configuration
• Tools– NTttcp
• Microsoft application to measure TCP/IP performance
– Tool to extract CPU performance counters
• Settings– 16 connections (4 per NIC port)– Overlapped I/O– Large Segment Offload (LSO)– Regular Ethernet frames (1518 bytes)– Checksum offload to NIC– Interrupt coalescing
8HPCA-10
Throughput and CPU Utilization
• Lower Rx performance for > 512 byte buffer sizes• Rx and Tx (no LSO) CPU utilization is 100% • Benefit of LSO is significant (~250% for 64KB buffer)• Lower throughput for < 1KB buffers is due to buffer locking
TCP/IP processing @ 1Gbps & 1460 bytes requires >1 CPU
Baseline Transmit and Receive Performance
0
1000
2000
3000
4000
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size (bytes)
Th
rou
gh
pu
t (M
bp
s)
020406080100120
CP
U U
tili
zati
on
(%
)
Perf (TX w/ LSO) Perf (TX w/o LSO) Perf (RX)
Util (TX w/ LSO) Util (TX w/o LSO) Util (RX)
9HPCA-10
Processing Efficiency
• 64 byte buffer– Tx (lso) – 17.13 and Rx – 13.7
• 64 KB buffer– Tx (lso) – 0.212, Tx (no LSO) – 0.53 and Rx – 1.12
Several cycles are needed to move a bit, especially for Rx
• Hz/bit Hertz per bit
0
5
10
15
20
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
Hz/
bit
TX (LSO) TX (no LSO) RX
10HPCA-10
Architectural Characterization
• Rx CPI higher than Tx for >512 byte buffers• Tx (LSO) CPI is higher than Tx (no LSO)!!!
CPI needs to come down to achieve TCP/IP scaling
• CPI System-Level CPI
123456
Buffer Size
CP
I
RX CPI TX CPI (LSO) TX CPI (no LSO)
11HPCA-10
Architectural Characterization
• Rx pathlength increase is significant after 1460 byte buffer sizes• For 64KB, TCP/IP stack has to receive and process 45 packets
• Lower CPI for Tx (no LSO) over Tx (LSO) is due to higher PL
High PL shows that there is room for stack optimizations
• PathlengthPathlength (instructions per buffer)
050,000
100,000150,000200,000250,000300,000
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
PL
per
bu
ffer
TX PL (LSO) TX PL (no LSO) RX PL
Pathlength (instructions per bit)
02468
10
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
ins
tru
cti
on
s p
er
bit
TX PL (LSO) RX PL TX PL (no LSO)
12HPCA-10
Architectural Characterization
• Rx has higher misses– Primary reason for higher CPI– Lot of compulsory misses
• Source buffer, descriptors, may be destination buffer
• Tx (no LSO) has slightly higher misses per bit
Rx performance does not scale with cache size (many compulsory misses)
• Last level Cache PerformanceLast-Level Cache Performance
00.0050.01
0.0150.02
0.025
Buffer Size
MP
I
RX TX (LSO) TX (no LSO)
Last-Level Cache Performance
0.0000.0020.0040.0060.0080.010
Buffer Size
Mis
ses
Per
bit
RX TX (no LSO) TX (LSO)
13HPCA-10
Architectural Characterization
• 32KB of data cache in Pentium® M processor• As expected L1 data cache misses are more for Rx• For Rx, 68% to 88% of L1 misses resulted in L2 hits
Larger L1 data cache has limited impact on TCP/IP
• L1 Data Cache PerformanceL1 Data Cache Performance
0
0.02
0.04
0.06
0.08
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
Mis
ses
per
in
stru
ctio
n
RX TX (LSO) TX (no LSO)
L1 Data Cache Performance
0.0000.0200.0400.0600.0800.100
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
Mis
ses
per
bit
RX TX (LSO) TX (no LSO)
14HPCA-10
Architectural Characterization
• L1 Instruction Cache Performance– 32KB instruction cache in Pentium® M processor– Tx (no LSO) MPI is lower because of code temporal locality– Rx code path generated L1 instruction capacity misses
Larger L1 instruction cache helps RX processing
• L1 Instruction Cache PerformanceL1 Instruction Cache Performance
00.0020.0040.0060.008
0.010.012
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
MP
I
TX (LSO) TX (no LSO) RX
o
L1 Instruction Cache Performance
0.0000.0050.0100.0150.020
Buffer Size
Mis
se
s p
er
bit
RX TX (LSO) TX (no LSO)
15HPCA-10
Architectural Characterization
• TLB Performance– Size
• 128 instruction and 128 data TLB entries
– iTLB misses increase faster than dTLB misses
TLB Performance
00.0005
0.0010.0015
0.0020.0025
0.0030.0035
Buffer Size
TL
B m
isse
s p
er
inst
ruct
ion
TX (LSO) RX TX (no LSO)
TLB Performance
0.0000
0.0050
0.0100
0.0150
0.0200
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
TL
B m
isse
s p
er
bit
TX (no LSO) TX (LSO) RX
• TLB Performance
16HPCA-10
Architectural Characterization
• 19-21% branch instructions• Misprediction rate is higher in Tx than Rx for <
512 byte buffer size
>98% accuracy in branch prediction
Branch Misprediction Rate
0.00000.0010
0.00200.0030
0.0040
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
mis
pre
dic
ts p
er
inst
TX (LSO) TX (no LSO) RX
Branch Misprediction Rate
0.0000.0050.0100.015
64 128
256
512
1024
1460
2048
4096
8192
1638
4
3276
8
6553
6
Buffer Size
mis
pre
dic
ts
pe
r b
it
RX TX (no LSO) TX (LSO)
• Branch Behavior
17HPCA-10
Architectural Characterization
• CPI Contributors– RX is more memory
intensive than TX
• Frequency Scaling– Poor Frequency
scaling due to memory latency overhead
CPI Breakdown (TX-LSO and RX)
0%
20%
40%
60%
80%
100%
64 128 256 512 1024 1460 64 128 256 512 1024 1460
TX RXTX / RX / Buffer Size
No
rmal
ized
CP
I
BR mispredicts TLB misses Memory L2 Rest of CPI
48%33%46%1460
54%37%52%1024
58%46%56%512
63%56%64%256
63%64%64%128
62%68%63%64
TX (no LSO)
RXTXTCP
Payload in Bytes
48%33%46%1460
54%37%52%1024
58%46%56%512
63%56%64%256
63%64%64%128
62%68%63%64
TX (no LSO)
RXTXTCP
Payload in Bytes
Frequency Scaling alone will not deliver 10x gain
18HPCA-10
TCP/IP in Server Workloads
• Webserver– TCP/IP data path overhead is ~28%
• Back-End (database server with iSCSI)– TCP/IP data path overhead is ~35%
• Front-End (e-commerce server)– TCP/IP data path overhead is ~29%
TCP/IP Processing is significant in commercial server workloads
TCP/IP (data-Path) Packet Processing in Server Workloads
0%20%40%60%80%
100%
Back-End WebServer Front-End
Server Type / Workload
No
rmal
ized
In
strs
/op
RemPP + App Estimated Data Path
19HPCA-10
Conclusions• Major Observations
– TCP/IP processing @ 1Gbps & 1460 bytes requires >1 CPU– CPI needs to come down to achieve TCP/IP scaling– High PL shows that there is room for stack optimizations– Rx performance does not scale w/ cache size (=> compulsory misses)– Larger L1 data cache has limited impact on TCP/IP– Larger L1 instruction cache helps RX processing– >98% accuracy in branch prediction– Frequency Scaling alone will not deliver 10x gain– TCP/IP Processing is significant in commercial server workloads
• Key Issues– Memory Stall Time Overhead– Pathlength (O/S Overhead, etc)
20HPCA-10
Ongoing work• Investigating Solutions to the Memory Latency Overhead
– Copy Acceleration• Low cost synchronous/asynchronous copy engine
– DCA• Incoming data is pushed into processor’s cache instead of memory
– Light weight Threads to hide memory access latency• Switch-on-event threads + small context & low switching overhead
– Smart Caching• Cache structures and policies for networking
• Partitioning– Optimized TCP/IP stack running on dedicated processor(s) or core(s)
• Other Studies– Connection processing, bi-directional data– Application interference