hpca-10 architectural characterization of tcp/ip processing on the intel® pentium® m processor...

21
HPCA-10 Architectural Characterization of TCP/IP Processing on the Intel® Pentium® M Processor Srihari Makineni & Ravi Iyer Communications Technology Lab Intel® Corp. {srihari.makineni & ravishankar.iyer}@intel.com HPCA-10

Upload: isabella-higgins

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

HPCA-10

Architectural Characterization of TCP/IP Processing on the Intel® Pentium® M

Processor

Srihari Makineni & Ravi IyerCommunications Technology Lab

Intel® Corp.{srihari.makineni & ravishankar.iyer}@intel.com

HPCA-10

2HPCA-10

Outline

• Motivation

• Overview of TCP/IP

• Setup and Configuration

• TCP/IP Performance Characteristics– Throughput and CPU Utilization– Architectural Characterization

• TCP/IP in server workloads

• Ongoing work

3HPCA-10

Motivation

• Why TCP/IP?– TCP/IP is the protocol of choice for data communications

• What is the problem?– So far system capabilities allowed TCP/IP to process

data at Ethernet speeds– But Ethernet speeds are jumping rapidly (1 to 10 Gbps)– Requires efficient processing to scale to these speeds

• Why architectural characterization?– Analyze performance characteristics and identify

processor architectural features that impact TCP/IP processing

4HPCA-10

TCP/IP Overview

UserUser

KernelKernel

ApplicationApplication

Sockets Interface

NICNIC

DriverDriver

TCP/IP StackTCP/IP Stack

Network Network HardwareHardware

DMA

Buffer

TCB

Desc 1ETH IP TCP

Desc 2

Tx 1 Tx 2 Tx 3

Eth Pkt 1

• Transmit

5HPCA-10

Rx 2

TCP/IP Overview

UserUser

KernelKernel

ApplicationApplication

Sockets Interface

NICNIC

DriverDriver

TCP/IP StackTCP/IP Stack

Network Network HardwareHardware

Copy

Signal/

Copy

Payload

Rx 3Rx 1

ETH IP TCP

BufferTCB

Descriptor

Eth Pkt 1

DMA

• Receive

6HPCA-10

Setup and Configuration

• Test setup– System Under Test (SUT)

• Intel® Pentium® M processor @ 1600MHz, 1MB (64B line) L2 cache

– 2 Clients• Four way Itanium® 2 processor @

1GHz; 3MB L3 (128B line) cache

System Under Test Source or Sink Test Servers

Itanium®

2 4P Platforms Pentium

® M

UP Platform

NICNICNICNIC

NICNICNICNIC

NIC NICNIC NIC

NIC NICNIC NIC

Client 1

Client 2

– Operating System• Microsoft Windows* 2003 Enterprise Edition

– Network• SUT – 4Gbps total (2 dual port Gigabit NICs)• Clients – 2Gbps per client (1 dual port Gigabit NIC)

7HPCA-10

Setup and Configuration

• Tools– NTttcp

• Microsoft application to measure TCP/IP performance

– Tool to extract CPU performance counters

• Settings– 16 connections (4 per NIC port)– Overlapped I/O– Large Segment Offload (LSO)– Regular Ethernet frames (1518 bytes)– Checksum offload to NIC– Interrupt coalescing

8HPCA-10

Throughput and CPU Utilization

• Lower Rx performance for > 512 byte buffer sizes• Rx and Tx (no LSO) CPU utilization is 100% • Benefit of LSO is significant (~250% for 64KB buffer)• Lower throughput for < 1KB buffers is due to buffer locking

TCP/IP processing @ 1Gbps & 1460 bytes requires >1 CPU

Baseline Transmit and Receive Performance

0

1000

2000

3000

4000

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size (bytes)

Th

rou

gh

pu

t (M

bp

s)

020406080100120

CP

U U

tili

zati

on

(%

)

Perf (TX w/ LSO) Perf (TX w/o LSO) Perf (RX)

Util (TX w/ LSO) Util (TX w/o LSO) Util (RX)

9HPCA-10

Processing Efficiency

• 64 byte buffer– Tx (lso) – 17.13 and Rx – 13.7

• 64 KB buffer– Tx (lso) – 0.212, Tx (no LSO) – 0.53 and Rx – 1.12

Several cycles are needed to move a bit, especially for Rx

• Hz/bit Hertz per bit

0

5

10

15

20

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

Hz/

bit

TX (LSO) TX (no LSO) RX

10HPCA-10

Architectural Characterization

• Rx CPI higher than Tx for >512 byte buffers• Tx (LSO) CPI is higher than Tx (no LSO)!!!

CPI needs to come down to achieve TCP/IP scaling

• CPI System-Level CPI

123456

Buffer Size

CP

I

RX CPI TX CPI (LSO) TX CPI (no LSO)

11HPCA-10

Architectural Characterization

• Rx pathlength increase is significant after 1460 byte buffer sizes• For 64KB, TCP/IP stack has to receive and process 45 packets

• Lower CPI for Tx (no LSO) over Tx (LSO) is due to higher PL

High PL shows that there is room for stack optimizations

• PathlengthPathlength (instructions per buffer)

050,000

100,000150,000200,000250,000300,000

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

PL

per

bu

ffer

TX PL (LSO) TX PL (no LSO) RX PL

Pathlength (instructions per bit)

02468

10

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

ins

tru

cti

on

s p

er

bit

TX PL (LSO) RX PL TX PL (no LSO)

12HPCA-10

Architectural Characterization

• Rx has higher misses– Primary reason for higher CPI– Lot of compulsory misses

• Source buffer, descriptors, may be destination buffer

• Tx (no LSO) has slightly higher misses per bit

Rx performance does not scale with cache size (many compulsory misses)

• Last level Cache PerformanceLast-Level Cache Performance

00.0050.01

0.0150.02

0.025

Buffer Size

MP

I

RX TX (LSO) TX (no LSO)

Last-Level Cache Performance

0.0000.0020.0040.0060.0080.010

Buffer Size

Mis

ses

Per

bit

RX TX (no LSO) TX (LSO)

13HPCA-10

Architectural Characterization

• 32KB of data cache in Pentium® M processor• As expected L1 data cache misses are more for Rx• For Rx, 68% to 88% of L1 misses resulted in L2 hits

Larger L1 data cache has limited impact on TCP/IP

• L1 Data Cache PerformanceL1 Data Cache Performance

0

0.02

0.04

0.06

0.08

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

Mis

ses

per

in

stru

ctio

n

RX TX (LSO) TX (no LSO)

L1 Data Cache Performance

0.0000.0200.0400.0600.0800.100

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

Mis

ses

per

bit

RX TX (LSO) TX (no LSO)

14HPCA-10

Architectural Characterization

• L1 Instruction Cache Performance– 32KB instruction cache in Pentium® M processor– Tx (no LSO) MPI is lower because of code temporal locality– Rx code path generated L1 instruction capacity misses

Larger L1 instruction cache helps RX processing

• L1 Instruction Cache PerformanceL1 Instruction Cache Performance

00.0020.0040.0060.008

0.010.012

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

MP

I

TX (LSO) TX (no LSO) RX

o

L1 Instruction Cache Performance

0.0000.0050.0100.0150.020

Buffer Size

Mis

se

s p

er

bit

RX TX (LSO) TX (no LSO)

15HPCA-10

Architectural Characterization

• TLB Performance– Size

• 128 instruction and 128 data TLB entries

– iTLB misses increase faster than dTLB misses

TLB Performance

00.0005

0.0010.0015

0.0020.0025

0.0030.0035

Buffer Size

TL

B m

isse

s p

er

inst

ruct

ion

TX (LSO) RX TX (no LSO)

TLB Performance

0.0000

0.0050

0.0100

0.0150

0.0200

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

TL

B m

isse

s p

er

bit

TX (no LSO) TX (LSO) RX

• TLB Performance

16HPCA-10

Architectural Characterization

• 19-21% branch instructions• Misprediction rate is higher in Tx than Rx for <

512 byte buffer size

>98% accuracy in branch prediction

Branch Misprediction Rate

0.00000.0010

0.00200.0030

0.0040

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

mis

pre

dic

ts p

er

inst

TX (LSO) TX (no LSO) RX

Branch Misprediction Rate

0.0000.0050.0100.015

64 128

256

512

1024

1460

2048

4096

8192

1638

4

3276

8

6553

6

Buffer Size

mis

pre

dic

ts

pe

r b

it

RX TX (no LSO) TX (LSO)

• Branch Behavior

17HPCA-10

Architectural Characterization

• CPI Contributors– RX is more memory

intensive than TX

• Frequency Scaling– Poor Frequency

scaling due to memory latency overhead

CPI Breakdown (TX-LSO and RX)

0%

20%

40%

60%

80%

100%

64 128 256 512 1024 1460 64 128 256 512 1024 1460

TX RXTX / RX / Buffer Size

No

rmal

ized

CP

I

BR mispredicts TLB misses Memory L2 Rest of CPI

48%33%46%1460

54%37%52%1024

58%46%56%512

63%56%64%256

63%64%64%128

62%68%63%64

TX (no LSO)

RXTXTCP

Payload in Bytes

48%33%46%1460

54%37%52%1024

58%46%56%512

63%56%64%256

63%64%64%128

62%68%63%64

TX (no LSO)

RXTXTCP

Payload in Bytes

Frequency Scaling alone will not deliver 10x gain

18HPCA-10

TCP/IP in Server Workloads

• Webserver– TCP/IP data path overhead is ~28%

• Back-End (database server with iSCSI)– TCP/IP data path overhead is ~35%

• Front-End (e-commerce server)– TCP/IP data path overhead is ~29%

TCP/IP Processing is significant in commercial server workloads

TCP/IP (data-Path) Packet Processing in Server Workloads

0%20%40%60%80%

100%

Back-End WebServer Front-End

Server Type / Workload

No

rmal

ized

In

strs

/op

RemPP + App Estimated Data Path

19HPCA-10

Conclusions• Major Observations

– TCP/IP processing @ 1Gbps & 1460 bytes requires >1 CPU– CPI needs to come down to achieve TCP/IP scaling– High PL shows that there is room for stack optimizations– Rx performance does not scale w/ cache size (=> compulsory misses)– Larger L1 data cache has limited impact on TCP/IP– Larger L1 instruction cache helps RX processing– >98% accuracy in branch prediction– Frequency Scaling alone will not deliver 10x gain– TCP/IP Processing is significant in commercial server workloads

• Key Issues– Memory Stall Time Overhead– Pathlength (O/S Overhead, etc)

20HPCA-10

Ongoing work• Investigating Solutions to the Memory Latency Overhead

– Copy Acceleration• Low cost synchronous/asynchronous copy engine

– DCA• Incoming data is pushed into processor’s cache instead of memory

– Light weight Threads to hide memory access latency• Switch-on-event threads + small context & low switching overhead

– Smart Caching• Cache structures and policies for networking

• Partitioning– Optimized TCP/IP stack running on dedicated processor(s) or core(s)

• Other Studies– Connection processing, bi-directional data– Application interference

HPCA-10

Q&A