anguita , m.; cañas, a.; díaz, a.f.; fernández, f.j.; ortega, j.; prieto, a
DESCRIPTION
Department of Computer Architecture and Technology. University of Granada (Spain). EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND COMMUNICATION IN LOW-COST HIGH PERFORMANCE PLATFORMS. - PowerPoint PPT PresentationTRANSCRIPT
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images and Signals: High Resolution and Low Resolution in Data and
Information Grids (21-22 February, Granada, Spain)
EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND
COMMUNICATION IN LOW-COST HIGH PERFORMANCE
PLATFORMS
Anguita, M.; Cañas, A.; Díaz, A.F.; Fernández, F.J.; Ortega, J.; Prieto, A.
Department of Computer Architecture and Technology
University of Granada (Spain)
2Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Outline of the talk1. Introduction: Grid computing
2. Communication performance improvement1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program development environments1. PVMTB and MPITB performance
2. Application examples: wavelets, pH control, nanoelectronics
3
As the available bandwidths of the networks increase, the location of the computing power becomes less relevant
It would be possible to use networks of computers as a single computing resource for large-scale applications
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Introduction: Grid Computing
The goal:(platform researching point of view)
To provide a transparent access to the available computing resources (including supercomputers, storage systems,...) and other geographically distributed devices and scientific instruments via a networked environment
Compartir recursos Manipular equipos..Share resouces Manipulate devices..Compartir recursos Manipular equipos..Share resouces Manipulate devices..Compartir recursos Manipular equipos..Share resouces Manipulate devices..
4Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Introduction: our goals in this context
Efficient exploitation of parallelism (at different levels) in low cost platforms based on clusters of computers:
• Improvement of communication bandwidths available to applications
• High-level programming environments for parallel program development
5
1. Introduction: Grid computing
2. Communication performance improvement1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program development environments1. Performance
2. Application examples: wavelets, pH control, nanoelectronics
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Outline of the talk
7Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Improving communication in Clusters (II)
• Reliable Transport system suited for Cluster Computing
• Developed on Linux (kernel module)
• Optimizes OS support for communication: (scheduler, NIC drivers, kernel functions)
• Upper layer systems (PVM, MPI,…) can be efficiently used on top of CLIC
CLIC improves the performance of the communications so that user-level applications can take advantage of network features (better latency & bandwidth, Broadcast, Channel Bonding).
CLIC (Communication in Linux Clusters) protocol
8Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Improving communication in Clusters (III)
Network Interface Circuit (NIC)
Driver
User Processes
CLIC
Socket
TCP
IP
User
Kernel
CLIC avoids the TCP/IP stack
LAM- MPI/ CLIC
Network Interface (NIC)
Software Driver
MPI upper layer
CLIC
Lower MPI Layer (sockets)
TCP
IP
LAM-MPI/ TCP
Upper MPI Layer
MPI upper layer
Upper MPI Layer
RPI Functions
RPI Functions
LAM-MPI has been efficiently implemented on CLIC
10
0
0,5
1
1,5
2
2,5
3
3,5
4
10 100 1000 10000 100000Message Size (bytes)
CLIC
LAM-MPI/CLIC
LAM-MPI/TCP
CLIC has lowerSoftware overhead
Speedup w.r.t.
PVM-TCP
TCP/IP workswhilepacketscrossingthe network
0
0,5
1
1,5
2
2,5
3
3,5
4
10 100 1000 10000 100000Message Size (bytes)
CLIC
LAM-MPI/CLIC
LAM-MPI/TCP
CLIC has lowerSoftware overhead
Speedup w.r.t.
PVM-TCP
TCP/IP workswhilepacketscrossingthe network
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Improving communication in Clusters (V)
• High improvement w.r.t. MPI/TCP and PVM/TCP
• MPI/CLIC provides a performance similar to CLIC
12Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Network technology trends: Gigabit networks
Datacenters and networks are moving towards 1-10 Gigabit technologies
10 Gigabit/s
10 Gigabit/s
Servers
Clusters
Hard disks
Ethernet switch Infiniband switch
Infiniband array
Ethernet switch
Fibre channel switch
Fibre channel array
10 Gigabit/s
10 Gigabit/s
10 Gigabit/s
14Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
CLIC on Gigabit Ethernet (I)
Techniques implemented on CLIC to take advantage of the gigabit network bandwidths:
• Jumbo frames: use MTUs longer (up to MTU=9000 bytes) than the Ethernet standard (MTU=1500 bytes)
Reduce the number of interrupts and the overhead related with the communication protocol processing
• Coalesced interrupt: the NIC only interrupts the processors after a given time interval, or a given number of packet arrivals.
Reduce the number of generated interrupts (at the cost of a delay in the reception)
• 0-Copy: data to be sent are copied directly from the user memory space to the NIC (to receive data, only one copy is needed)
15Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
O.S. O.S. (driver)driver)
O.S. O.S. (driver)User driver)
O.S. (driver)driver)
TCP
(2 copies)
CLIC (Fast Ethernet)
(1-copy)
CLIC (Gigabit Ethernet)
(0-copy)
O.S. O.S. (driver)driver)
O.S. O.S. (driver)User driver)
O.S. (driver)driver)
TCP
(2 copies)
CLIC (Fast Ethernet)
(1-copy)
CLIC (Gigabit Ethernet)
(0-copy)
CLIC on Gigabit Ethernet (II)
Data to be sent go from user memory to system memory (1-copy), in order to build the packets, and then to the NIC
CLIC takes advantage of the new drivers for Gigabit network cards:
Data to be sent can go directly from user memory to the NIC
Data to be sent go from user memory to system memory, then another copy is done to build the packets (2 copies), and then to the NIC
16Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
0
150
300
450
600
750
1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07
0 - copy MTU 9000
1 - copy MTU 9000
0 - copy MTU 1500
1 - copy MTU 1500
Size (bytes)
Mbps
CLIC on Gigabit Ethernet (III)
Comparison of bandwidths provided by CLIC on Gigabit Ethernet with 0-copy/1-copy and MTU=9000/1500:
Using MTU=9000 bytes has more impact than using 0-copy
Latency=36μs (messages of 0 bytes)
50% of maximum bandwidth: 4 KBytes
18Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Further Improvements: Intelligent NICThe emergence of fast, cheap embedded processors allows the use of Intelligent Network Interface Cards (INIC), including one or more processors, to assist communication by offloading protocol processing: the entire communication protocol is configured and moved to the INIC
Consequences:
• The load on the CPU (from the communication process) is reduced
It is possible for the applications to take advantage from overlapping communication and computation.
• The card becomes protocol-aware and can interact with the network without CPU intervention
The overall protocol latency is reduced as short messages do not have to cross the peripheral (PCI) bus and the CPU does not have to service an interrupt and perform a context switch for each one.
• The INIC can transfer data more efficiently to/from the CPU (small messages can be reassembled in the INIC and block-transferred to the main memory rather than a sequence of short DMA transfers).
• Implementing the communication protocols in the INIC contributes to reduce the effect of the I/O (PCI) bus bottleneck.
19Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Outline of the talk1. Introduction: Grid computing
2. Communication performance improvement1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program development environments1. PVMTB and MPITB performance
2. Application examples: wavelets, pH control, nanoelectronics
20Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
• M-files– interpreted– fast-prototyping– save & run
P-code M-file
C
C sourceMEX
interactive try-and-errorintegrated debugger
intermediate compile/link stepinvolved configuration
normal debuggerinvolved breakpoints
MATLAB interpreted environment
• MEX-files– compiled– lower-level
• computing-intensive• access to libraries• data export/import• hardware control
22Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Parallel Toolboxes• Toolbox of MEX files, each PVM/MPI routine has its own MEX
Network
Operating System
MATLAB
PVM app
PVM MPI
MPI appPVMTB MPITB
MATLAB application
PVMTB: 93 cmds (interfaces 86 PVM calls)MPITB: 153 cmds (interfaces 135 MPI calls)
• Have been used for:– signal processing: wavelet transform (UGR, Spain)
– automatic control: real-time pH control (UNED, Spain)
– chemical engineering: chemical manufact. simul. (Carnegie Mellon, USA)
– nanoelectronics: nanoscale device simul. (CELAB, Purdue, USA)
24Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Performance results“Performance of Parallel MATLAB Toolboxes”, VecPar’02, Porto, Portugal
Latency 1.8x
Overhead 20%@ 1500B
33Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Application: pH controlS. Dormido (UNED, Spain) ASCC’02 Suntec, Singapore
“Dynamic Programming on clusters for solving Control problems”
PHm Flm
PHm
cluster de PCs
controladorpredictivo
ÁCIDOBASE
V
qcA
PHm
cB
u
PHm
Flm
PH-metro
Caudalímetro
Cluster “smaug”:16 Athlon K7 500MHz, 128MB, 7GB HDserver with 20GB HD, 2NICs, KVM
34
tp
inicio etapa 1 etapa 2
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
pH control: results
tp
inicio etapa 1 etapa 21 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
Nº de procesadores esclavos (M)
t p (
segu
ndo
s)
PD_SIN/E/ME con ligaduras
Aceleración lineal
S(M
), fa
cto
r de
ace
lera
ció
n
Nº de procesadores esclavos (M)
PD_SIN/E/ME sin ligaduras
PD_SIN/E/ME con ligaduras
PD_SIN/E/ME sin ligaduras
35Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Application: nanoelectronicsS. Goasguen (Purdue, USA) IEEE-NANO’02 Arlington-VA, USA
“Parallelization of nanoMOS2.0 using a 100-nodes Linux cluster”
36Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
nanoMOS: cluster “superman”
39Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
nanoMOS: results
e ~ 84.2%
e ~ 88.8%
e ~ 95.3%
efficiency = 98.0%
40Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
Conclusions
Parallel Toolboxes PVMTB-MPITB:
• Fast learning / Fast prototyping of parallel applications on clusters
• Useful for research: small overhead, acceptable efficiency even with 120 CPUs
• Foreseen: efficiency improvement by compiling application M-files (MATLAB compiler) and by linking them against MPI/CLIC and PVM/CLIC.
Lightweight protocol CLIC:
• Portable to any Linux machine, benefits from OS resources/Gbps driver features
• Facilitated tracking of technology advances, it’s not NIC/CPU dependant
• Exposes latency/bandwidth improvements at application level (MPI-PVM/CLIC)
• Foreseen: INICs with embedded processors to offload protocol load from CPU and avoid I/O bus bottleneck (PCI)
Grids are made of clusters / other resources. Inside them, we want/need:
• Communication performance, promptly tracking technology advances (Gbps, INICs)
• Parallel application development environments, benefiting from those improvements