1 memory performance evaluation of high thoughput servers garba ya’u isa master’s thesis oral...
TRANSCRIPT
1
MEMORY PERFORMANCE EVALUATION MEMORY PERFORMANCE EVALUATION OF OF
HIGH THOUGHPUT SERVERSHIGH THOUGHPUT SERVERS
Garba Ya’u IsaMaster’s Thesis Oral DefenseComputer EngineeringKing Fahd University of Petroleum & Minerals
Saturday, 7th June 2003
2
Introduction Problem Statement Analysis of Memory Accesses Measurement Based Performance Evaluation Design and Implementation of Prototype Contributions Conclusions Future Work
Outline
3
Introduction
Processor and memory performance discrepancy
Growing network bandwidth Data rates in Terabits per
second possible Gigabit per second LANs
already deployed High throughput servers in
network infrastructure Streaming media servers Web servers Software Routers
10
1
100
1000
10,000
Year
Per
form
ance
4
Dealing with Performance Gap
Hierarchical memory architecture temporal locality spatial locality
Constrains Characteristics of network payload data:
Large won’t fit into cache Hardly reusable poor temporal locality
5
Problem Statement
Network servers should: Deliver high throughput Respond to requests with
low latency Respond to large number
of clients
Our goal Identify specific conditions
at which server memory becomes a bottleneck
Includes: cache, main memory, and virtual memory
Benefits Better server design that
alleviates memory bottlenecks
Optimal performance can be achieved
Constraints Large amount of data
flowing through CPU and memory
Writing code to optimize memory utilization is a challenge
6
Analysis of Memory Accesses: Data Flow Analysis
Four data transfer paths:
Memory-CPU Memory-memory Memory-I/O Memory-network
Processor
On-chip cache
Off-chip cache
Bus/DMAcontroller
I/O bus
Internal (CPU-memory) bus
Mainmemory
Disk controllerNetworkinterface
Networkinterface
Disk
Disk
Disk transfer via DMA
Network transfer via DMA
Memory-memory transfer via CPU
Cache-memorytransfers
7
Latency Model and Memory Overhead
Each transaction involves: CPU cycles Data transfers: one or more of four identified types
Transaction latency:
Ttrans = Tcpu + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net
Tcpu Total CPU time needed for the transaction Tm-c Time to transfer entire PDU from memory to CPU for proc. Tm-m Latency of memory-memory copy of a PDU Tm-disk Latency of memory-I/O read/write of a block of data Tm-net Latency of memory-network read/write of a PDU ni Number of each type of data movement operations
8
Memory-CPU Transfers
PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)
Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)
Cache miss rate Worst case: MR = 1 (not as bad!) Best case: MR = 0 (trivial)
9
Cache overhead in various cases: Worst case: MR = 1, MP = 10 and (MR)(MP) 10 Best case: MR = 0 trivial Average case: MR = 0.1, MP = 10 and (MR)(MP)1
Memory-CPU latency dependent on internal bus bandwidth Tm-c = S/32Bi usec where S is the PDU size and Bi is the
internal bus bandwidth in MB/s
Memory-CPU Transfers cont.
10
Memory-memory transfer: Due to memory copy of PDU between protocol layers Transfers through caches and CPU Stride =1 (contiguous) Transfer involves memorycacheCPUcachememory
data movement
Latency: Dependent on internal (system) bus bandwidth Tm-m = 2S/Bi usec
Memory-Memory Transfers
11
Memory-network transfers: Passes over the I/O bus DMA can be used Again, stride = 1 (contiguous)
Latency: Limiting factor is the I/O bus bandwidth Tm-net = S/Be usec
Memory-I/O and Memory-Network Transfers
12
RTP Transaction Latency
HTTP Transaction Latency
IP Transaction Latency
4
32RTPRTP cpui i
S S ST T
B B Be
4
32HTTPHTTP cpui i
S S ST T
B B Be
2IPIP cpu
ST T
Be
1
2
3
Latency of Reference Applications
13
Assumptions CPU usage latency compared to data transfer latencyis negligible and can be ignored Bus contention from multiple simultaneously executedtransactions do not result in any additional overhead
Server Throughput = S/T S = size of transaction data T = latency of a transaction given by equations 1, 2 and 3
Peak Throughputs
14
Peak Throughputs cont.
Processor Internal busbandwidth(MB/sec)
Throughput of three network applications
IP forwarding(Mbits/sec)
HTTP(Mbits/sec)
RTPStreaming(Mbits/sec)
Intel Pentium IV 3.06 GHz 3200 4264 3640 3640
AMD Athlon XP 3000+ 2700 4264 3291 3291
MIPS R16000 700 MHz 3200 4264 3640 3640
Sun Ultraspac III 900 MHz 1200 4264 1862 1862
15
Measurement Based PerformanceEvaluation
Experimental Testbed Dual boot server (Pentium IV 2.0 GHz)
256 MB RAM 1.0 GHz NIC
Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)
Tools Intel Vtune Windows Performance Monitor Netstat Linux tools: vmstat, sar, iostat
16
Platforms and Applications
Platforms Linux (kernel 2.4.7-10) Windows 2000
Applications Streaming media servers
Darwin streaming server Windows media server
Web servers Apache web server Microsoft Internet Information server
Software router Linux kernel IP forwarding
17
Analysis of Operating System Role
0
1000
2000
3000
4000
5000
6000
block size (working set)
Mem
ory b
andw
idth
(Mby
tes/se
c)
Linux
Windows
Memory Throughput Test ECT (extended copy
transfer) – memperf
Locality of reference: temporal locality – varying
working set size (block size) spatial locality – varying
access pattern (strides)
18
Context switching overhead
0
1
2
3
4
5
6
7
8
2 4 8 16 32 64 128
Number of threads
usec
/con
text
sw
itch
Linux
Windows
Analysis of Operating System Role cont.
19
Streaming Media Servers
Experimental Design Factors
Number of streams (streaming clients) Media encoding rate (56kbps and 300kbps) Stream distribution (unique and multiple media)
Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput
Benchmarking Tools DSS - streaming load tool WMS – media load simulator
20
Cache Performance
1
101
201
301
401
501
601
701
801
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns)
dss, unique
dss, multiple
wms, unique
wms, multiple
L1 cache misses (56kbps)
21
1
101
201
301
401
501
601
701
801
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns) dss, unique
dss, multiple
wms, unique
wms, multiple
L1 cache misses (300 kbps)
Cache Performance cont.
22
Memory Performance
0
100
200
300
400
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
pag
e fa
ult
s /
sec
dss, unique
dss, multiple
wms, unique
wms, multiple
Page fault (300kbps)
23
1
10001
20001
30001
40001
50001
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
thro
ug
hp
ut
(kb
ps)
dss, unique
dss, multiple
wms, unique
wms, multiple
Throughput Throughput (300kbps)
24
Summary: Streaming Media Server Memory Performance
Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects.
When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients.
Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.
25
Web Servers
Experimental Design
Factors Number of web clients Document size
Metrics Cache miss (L1 and L2 cache) Page fault rate Throughput Transactions/sec (connection rate) Average latency
Benchmarking Tool Webstone
26
Transactions
0100020003000400050006000700080009000
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
BFile size (Kilobytes)
Tra
nsa
ctio
ns/
sec
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
27
L1 Cache Miss
0100200300400500600700800900
1000
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
L1
cach
e m
isse
s (m
illi
on
s)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
28
Page Fault
0100200300400500600700800900
1000
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
pag
e fa
ult
s /s
ec apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
29
Throughput
0
100
200
300
400
500
600
700
5B 50B
500B
5KB
10KB
50KB
100K
B
500K
B5M
B50
MB
File size (Kilobytes)
Th
rou
gh
pu
t (M
byt
es/s
ec)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
30
Summary: Web Server Memory Performance Evaluation
Attribute Value
Apache IIS
Max. transaction rate (conn/sec)Max. throughput (Mbps)CPU utilization (%)
258621771
4178 (58 % more than apache)349 (62% more than Apache)
63
L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec)
4241673< 10
200117< 10
Comparing Apache and IIS for an average file size of 10K
31
Software Router
Experimental Design Factors
Routing configurations TCP message size (64bytes, 10 Kbytes, and 64 Kbytes)
Metrics Throughput Number of context switching Number of active pages
Benchmarking Tool Netperf
32
Software Router Throughput
Ethernet interface 0
050
100150200250300350400450500
1 2 3 4 5 6 7 8
Configuration
Mb
its/s
ec
64 bytes packet
10K packet
64K packet
Ethernet interface1
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Configuration
Mb
its/s
ec
64 bytes packet
10K packet
64K packet
Ethernet interface 3
0
50
100
150
200
250
1 2 3 4 5 6 7 8
Configuration
Mb
its/s
ec
64 bytes packet
10K packet
64K packet
Ethernet interface 2
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8
Configuration
Mb
its/s
ec
64 bytes packet
10K packet
64K packet
33
CPU Utilization
CPU utilization
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8
Configuration
CP
U u
tili
zati
on
%
64 bytes packet
10K packet
64K packet
34
Context Switching
context switchinig
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8
Configuration
con
text
/sec 64 bytes packet
10K packet
64K packet
35
Active Page
Active page
860
880
900
920
940
960
980
1000
1020
1 2 3 4 5 6 7 8
Configuration
nu
mb
er o
f ac
tive
pag
es
64 bytes packet
10K packet
64K packet
36
Summary: Software Router Performance Evaluation
Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication.
Highest CPU utilization was 84%
Highest context switching rate was 5378/sec
Number of active pages fairly uniformly distributed. Indicates low memory activity.
37
Design, Implementation and Evaluation of Prototype DB-RTP
ServerArchitecture
Implementation Linux platform (C) Our implementation of RTSP/RTP (why?)
RTPpacketizer
NIC
UDP/IP&
TCP/IPstack
RTSP server&
scheduler
Parser
RTP serverDisk memory buffer
media chunk RTP packet IP packet
To media client
From media client
38
Double Buffering and Synchronization
Start
Next bit
Dirty_bit_B= 1
Dirty_bit_A= 1
readBuffer_B
readBuffer_A
Next = 0 - ANext = 1 - B
no
yesyes
no
Buffer readStart
Fetch mediachunk from
disk
Next bit
Dirty_bit_B= 0
Dirty_bit_A= 0
writeBuffer_B
writeBuffer_A
Next = 0 - ANext = 1 - B
no
yesyes
no
Buffer write
39
RTP Server Throughput
0
10
20
30
40
50
60
70
Number of streams
Ban
dw
idth
Mb
ps
RTP-unique
RTP-multiple
DB-RTP-unique
DB-RTP-multiple
40
Jitter
0
5000
10000
15000
20000
25000
10 30 50 70 90 110
Number of streams
jitt
er u
sec RTP-unique
RTP-multiple
DB-RTP-unique
DB-RTP-multiple
41
Throughput DB-RTP server – 63.85 Mbps RTP server – 59 Mbps.
Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server.
Summary: DB-RTP Server Performance Evaluation
42
Contributions
Cache overhead analysis. Memory latency and bandwidth analysis Measurement-based performance evaluation Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.
43
Conclusions
High throughput is possible with server design enhancement. Server throughput is significantly degraded by
excessive cache misses and page faults. Latency hiding with pre-fetching and buffering can
improve throughput and jitter performance
44
Future Work
Server Development hybrid = multiplexing + multithreading
Special Architectures (Network processors & ASICs) resource scheduling investigation of the role I/O use of IRAM (intelligent RAM) architectures integrated network infrastructure server
45
Thank you
46 go back
Array restructuring
Loop nest transformation
Array PaddingArray Padding
float rgbFrames [64][64][64][8]
Original code
float rgbFrames [65][65][65][9]
Transformed code
float rgbFrames [64][64][64][8]
Original code
float rgbFrames [8][64][64][64]
Transformed code
float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (i=0; i<64; i++) for (j=0; j<64; j++) for (k =0; k<64; k++) for (l=0;l<8; l++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]
float rgbFrames [8][64][64][64];float yuvFrames[8]64][64][64];int i, j, k, l;for (l=0; l<8; l++) for (i=0; i<64; i++) for (j =0; j<64; j++) for (k=0;k<64; k++) yuvFrame[l][i][j][k] = rgbFrames [l][i][j][k]
Original code
Transformed code
47
Testbeds
IDC Card
Card
Card
Card
Linux RouterServer
Router clients
NICs
go back
A B C D E F G HSELECTED
ON-LINE
Dual boot server(Windows 2000/Linux Server)
Triple-boot client machines(Windows 2000/Linux Server)
Catalyst 35501 Gbps switch
Streaming media/web servertestbed
Software routertestbed
48
Communication Configurations
go back
host host
host
host
host
host
host
hosthosthosthost
host
hosthost
host
1-1 communication(1. simplex 2. duplex)
Double 1-1communication
(3. simplex 4. duplex)
1-4 communication(5. simplex 6. duplex)
Ring communication(7. simplex 8. duplex)
49
Backup slides
50
0
100
200
300
400
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
pag
e fa
ult
s /
sec
dss, unique
dss, multiple
wms, unique
wms, multiple
0
100
200
300
400
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
pag
e fa
ult
s /
sec
dss, unique
dss, multiple
wms, unique
wms, multiple
Page fault
56 kbps 300 kbps
Memory Performance
51
0
10
20
30
40
50
60
70
80
90
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
cp
u u
tili
zati
on
(%
)
dss, unique
dss, multiple
wms, unique
wms, multiple
Streaming Server: CPU Utilization
52
0
500
1000
1500
2000
2500
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns) dss, unique
dss, multiple
wms, unique
wms, multiple
L2 cache misses (56kbps)
Cache Performance cont.
53
0
500
1000
1500
2000
2500
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns) dss, unique
dss, multiple
wms, unique
wms, multiple
L2 cache misses (300kbps)
Cache Performance cont.
54
Web Servers
0100200300400500600700800900
1000
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
L1
cach
e m
isse
s (m
illi
on
s)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
0
500
1000
1500
2000
2500
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
L2
cach
e m
isse
s (m
illi
on
s)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
Cache performanceL1 cache misses L2 cache misses
Transaction
0100020003000400050006000700080009000
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
Tra
nsa
ctio
ns/
sec
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
55
0
50
100
150
200
250
300
350
5B 50B
500B
5KB
10KB
50KB
100K
B
500K
B5M
B50
MB
File size (Kilobytes)
late
ncy
(se
c) apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
0
20
40
60
80
100
120
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
cpu
uti
liza
tio
n (
%)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
Latency
CPU Utilization
Web Servers
56
DB-RTP Server
0
5
10
15
20
25
30
35
40
Number of streams
nu
mb
er
of
cach
e m
isses
(mil
lio
ns)
RTP-unique
RTP-multiple
DB-RTP-unique
DB-RTP-multiple
0
1
2
3
4
5
6
7
8
9
10
20
30
40
50
60
70
80
90
100
110
120
Number of streams
nu
mb
er
of
cach
e m
isses
(mil
lio
ns)
RTP-unique
RTP-multiple
DB-RTP-unique
DB-RTP-multiple
0
0.5
1
1.5
2
2.5
3
3.5
4
10 30 50 70 90 110
Number of streams
CP
U u
tili
zati
on
%
RTP-unique
RTP-multiple
DB-RTP-unique
DB-RTP-multiple
L1 cache misses L2 cache misses
CPU Utilization
57
Memory Performance Evaluation Methodologies
Analytical Requires just paper and pencil Accuracy?
Simulation Requires programming Time and cost?
Measurement Real system or a prototype required Using on-chip counters Benchmarking tools More accurate
58
Server Performance Tuning
Memory performance tuning
Array padding Array restructuring Loop nest transformation
Latency hiding and multithreading
EPIC (IA-64) VIRAM Impulse
Multiprocessing and clustering
Task parallelization E.g. Panama cluster
router
Special Architectures Network processors ASICs and Data flow
architectures
59
Temporal vs. spatial locality A PDU lacks temporal locality Observation: PDU processing exhibits excellent spatial
locality Suppose data cache line is 32 bytes (or 16 words) long Sequential accesses with stride = 1 Accessing one word, brings other 15 words as well Thus, effective MR = 1/16 = 6.2% better than even scientific
apps Thus, generally MR = W/L
W - Width of each memory access (in bytes) L - Length of each cache line (in bytes)
Validation of above observation: Similar special locality characteristics reported via measurements:
S. Sohoni et al., “A Study of Memory System Performance of Multimedia Applications,” in proc. of ACM SIGMETRICS 2001
MR for streaming media player better than SPEC benchmark apps!
60
Memory-CPU Transfers
PDU Processing checksum computation and header updating Typically, one-way data flow (memory to CPU via cache)
Memory stall cycles Number of memory stall cycles = (IC)(AR)(MR)(MP)
IC – Instruction count per transaction AR – Number of memory accesses/instruction (AR=1) MR – Ratio of cache misses to memory accesses MP – Miss penalty in terms of clock cycles
Cache miss rate Worst case: MR = 1 while typically MP = 10 Stall cycles = 10 x IC
61
Determine cache overhead wrt execution time: (Execution time)no-cache = (IC)(CPI)(CC) (Execution time)with-cache = (IC)(CPI)(CC) {1 + (MR)(MP)} Cache overhead = 1 + (MR)(MP)
Cache overhead in various cases: Worst case: MR = 1 and MP = 10
Cache results in 11 times higher latency for each transaction!
Memory-CPU latency dependent on internal bus bandwidth Best case: MR = 0 trivial Average case: MR = 0.1 and MP = 10 and (MR)
(MP)1 Latency due to stalls = ideal execution time without stalls
Tm-c = S/32Bi usec where S is the PDU size and Bi is the internal bus BW in MB/s
Memory-CPU Transfers cont.
62
Open Questions
Role of specific-purpose architecture on performance of high throughput servers (e.g. network processor)
Role of memory compression
Role of scheduling
Open Questions