an analysis of 10-gigabit ethernet protocol stacks in multi-core environments g. narayanaswamy, p....
TRANSCRIPT
![Page 1: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/1.jpg)
An Analysis of 10-Gigabit
Ethernet Protocol Stacks in
Multi-core Environments
G. Narayanaswamy, P. Balaji and W. Feng
Dept. of Comp. Science
Virginia Tech
Mathematics and Comp. Science
Argonne National Laboratory
![Page 2: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/2.jpg)
High-end Computing Trends
• High-end Computing (HEC) Systems
– Continue to increase in scale and capability
– Multicore architectures
• A significant driving force for this trend
• Quad-core processors from Intel/AMD
• IBM cell, SUN Niagara, Intel Terascale processor
– High-speed Network Interconnects
• 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics
• Different stacks use different amounts of hardware support
• How do these two components interact with each other?
![Page 3: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/3.jpg)
Multicore Architectures
• Multi-processor vs. Multicore systems– Not all of the processor hardware is replicated for multicore
systems
– Hardware units such as cache might be shared between the
different cores
– Multiple processing units embedded on the same processor
die inter-core communication faster than inter-processor
communication
• On most architectures (Intel, AMD, SUN), all cores are
equally powerful makes scheduling easier
![Page 4: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/4.jpg)
Interactions of Protocols with Multicores• Depending on how the stack works, different protocols
have different interactions with multicore systems
• Study based on host-based TCP/IP and iWARP
• TCP/IP has significant interaction with multicore systems– Large impacts on application performance
• iWARP stack itself does not interact directly with multicore
systems– Software libraries built on top of iWARP DO interact
(buffering of data, copies)
– Interaction similar to other high performance protocols
(InfiniBand, Myrinet MX, Qlogic PSM)
![Page 5: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/5.jpg)
TCP/IP Interaction vs. iWARP Interaction
Network
TCP/IP stack
App App App
iWARP offloadedNetwork
Library
App App App
Library Library
TCP/IP is some ways more asynchronous or “centralized” with respect to host-processing as compared to iWARP (or other high performance software stacks)
Packet Arrival
Packet Processing
Packet Arrival
Packet Processing
Host-processing independent of
application process (statically tied to a
single core)
Host-processing closely tied to
application process
![Page 6: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/6.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 7: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/7.jpg)
MPI Bandwidth over TCP/IPIntel Platform
0
500
1000
1500
2000
2500
3000
35001 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3
AMD Platform
0
500
1000
1500
2000
2500
3000
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3
![Page 8: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/8.jpg)
MPI Bandwidth over iWARPIntel Platform
0
1000
2000
3000
4000
5000
6000
70001 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3
AMD Platform
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3
![Page 9: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/9.jpg)
TCP/IP Interrupts and Cache MissesHardware Interrupts
0.01
0.1
1
10
100
1000
10000
1000001 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Inte
rru
pts
pe
r M
ess
ag
e
Core 0
Core 1
Core 2
Core 3
L2 Cache Misses
-50
0
50
100
150
200
250
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Message Size (bytes)
Pe
rce
nta
ge
Diff
ere
nce
Core 0
Core 1
Core 2
Core 3
![Page 10: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/10.jpg)
MPI Latency over TCP/IP (Intel Platform)Small Message Latency
0
5
10
15
20
25
30
35
40
45
50
1 4 16 64 256 1K 4K
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0 Core 1
Core 2 Core 3
Large Message Latency
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
128K 256K 512K 1M 2M 4MMessage Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3
![Page 11: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/11.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 12: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/12.jpg)
Application Behavior Pre-analysis
• A four-core system is effectively a 3.5 core system– A part of a core has to be dedicated to communication
– Interrupts, Cache misses
• How do we schedule 4 application processes on 3.5
cores?
• If the application is exactly synchronized, there is not
much we can do
• Otherwise, we have an opportunity!
• Study with GROMACS and LAMMPS
![Page 13: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/13.jpg)
GROMACS Overview• Developed by Groningen University
• Simulates the molecular dynamics of biochemical particles
• The root distributes a “topology” file corresponding to the
molecular structure
• Simulation time broken down into a number of steps– Processes synchronize at each step
• Performance reported as number of nanoseconds of
molecular interactions that can be simulated each day
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
Combination A 0 4 2 6 7 3 5 1
Combination B 0 2 4 6 5 1 3 7
![Page 14: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/14.jpg)
GROMACS: Random Scheduling
Gromacs LZM Application
0
5
10
15
20
25
30
TCP/IP iWARP
ns/
da
y
Combination A
Combination B
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
Computation MPI_Wait Other MPI calls
Machine 1 cores Machine 2 cores
![Page 15: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/15.jpg)
GROMACS: Selective Scheduling
Gromacs LZM Application
0
5
10
15
20
25
30
TCP/IP iWARP
ns/
da
y
Combination A
Combination B
Combination A'
Combination B'
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
Computation MPI_Wait Other MPI calls
Machine 1 cores Machine 2 cores
![Page 16: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/16.jpg)
LAMMPS Overview• Molecular dynamics simulator developed at Sandia
• Uses spatial decomposition techniques to partition the
simulation domain into smaller 3-D subdomains– Each subdomain allotted to a different process
– Interaction required only between neighboring subdomains –
improves scalability
• Used the Lennard-Jones liquid simulation within LAMMPS
Core 0 Core 1 Core 2 Core 3
Core 0 Core 1 Core 2 Core 3
Network
![Page 17: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/17.jpg)
LAMMPS: Random Scheduling
LAMMPS Application
0
2
4
6
8
10
12
TCP/IP iWARP
Co
mm
un
ica
tion
Tim
e (
seco
nd
s)
Combination A
Combination B
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
MPI_Wait MPI_Send Other MPI calls
Machine 1 cores Machine 2 cores
![Page 18: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/18.jpg)
LAMMPS: Intended Communication Pattern
Computation
MPI_Send() MPI_Send()
MPI_Irecv() MPI_Irecv()
MPI_Wait() MPI_Wait()
MPI_Send() MPI_Send()
MPI_Irecv() MPI_Irecv()
![Page 19: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/19.jpg)
LAMMPS: Actual Communication Pattern
Computation
MPI_Send() MPI_Send()
MPI_Wait()
MPI_Wait()
MPI buffer
Socket Send Buffer
Socket Recv Buffer
Application Recv Buffer
MPI_Send()
Application Recv Buffer
“Slower” Core Faster Core
MPI buffer
Socket Send Buffer
Socket Recv Buffer
MPI_Send()
Application Recv Buffer
“Slower” Core Faster Core
Computation
“Out-of-Sync” Communication between processes
![Page 20: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/20.jpg)
LAMMPS: Selective Scheduling
LAMMPS Application
0
2
4
6
8
10
12
TCP/IP iWARP
Co
mm
un
ica
tion
Tim
e (
seco
nd
s)
Combination A
Combination B
Combination A'
Combination B'
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
MPI_Wait MPI_Send Other MPI calls
Machine 1 cores Machine 2 cores
![Page 21: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/21.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 22: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/22.jpg)
Concluding Remarks and Future Work• Multicore architectures and high-speed networks are
becoming prominent in high-end computing systems– Interaction of these components is important and interesting!
– For TCP/IP scheduling order drastically impacts performance
– For iWARP scheduling order has no overhead
– Scheduling processes in a more intelligent manner allows
significantly improved application performance
– Does not impact iWARP and other high-performance stack
making the approach portable while efficient
• Dynamic process to core scheduling!
![Page 23: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/23.jpg)
Thank You
Contacts:
Ganesh Narayanaswamy: [email protected]
Pavan Balaji: [email protected]
Wu-chun Feng: [email protected]
For More Information:
http://synergy.cs.vt.edu
http://www.mcs.anl.gov/~balaji
![Page 24: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/24.jpg)
Backup Slides
![Page 25: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ec65503460f94bd1bfe/html5/thumbnails/25.jpg)
MPI Latency over TCP/IP (AMD Platform)Small Message Latency
0
5
10
15
20
25
30
35
40
45
50
1 4 16 64 256 1K 4K
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0 Core 1
Core 2 Core 3
Large Message Latency
0
5000
10000
15000
20000
25000
128K 256K 512K 1M 2M 4MMessage Size (bytes)
Ba
nd
wid
th (
Mb
ps)
Core 0
Core 1
Core 2
Core 3