anguita , m.; cañas, a.; díaz, a.f.; fernández, f.j.; ortega, j.; prieto, a

Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images and Signals: High Resolution and Low Resolution in Data and

Information Grids (21-22 February, Granada, Spain)

EFFICIENT PARALLEL PROCESSING, PROGRAM DEVELOPMENT AND

COMMUNICATION IN LOW-COST HIGH PERFORMANCE

PLATFORMS

Anguita, M.; Cañas, A.; Díaz, A.F.; Fernández, F.J.; Ortega, J.; Prieto, A.

Department of Computer Architecture and Technology

University of Granada (Spain)

2Efficient parallel processing, program development and communication in low-cost high performance platforms

Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)

Outline of the talk1. Introduction: Grid computing

2. Communication performance improvement1. CLIC on Fast Ethernet

2. CLIC on Gigabit Ethernet

3. Grid- and cluster-aware program development environments1. PVMTB and MPITB performance

2. Application examples: wavelets, pH control, nanoelectronics

3

As the available bandwidths of the networks increase, the location of the computing power becomes less relevant

It would be possible to use networks of computers as a single computing resource for large-scale applications

Efficient parallel processing, program development and communication in low-cost high performance platforms


Introduction: Grid Computing

The goal:(platform researching point of view)

To provide a transparent access to the available computing resources (including supercomputers, storage systems,...) and other geographically distributed devices and scientific instruments via a networked environment

Compartir recursos Manipular equipos..Share resouces Manipulate devices..Compartir recursos Manipular equipos..Share resouces Manipulate devices..Compartir recursos Manipular equipos..Share resouces Manipulate devices..



Introduction: our goals in this context

Efficient exploitation of parallelism (at different levels) in low cost platforms based on clusters of computers:

• Improvement of communication bandwidths available to applications

• High-level programming environments for parallel program development

5

1. Introduction: Grid computing



3. Grid- and cluster-aware program development environments1. Performance




Outline of the talk



Improving communication in Clusters (II)

• Reliable Transport system suited for Cluster Computing

• Developed on Linux (kernel module)

• Optimizes OS support for communication: (scheduler, NIC drivers, kernel functions)

• Upper layer systems (PVM, MPI,…) can be efficiently used on top of CLIC

CLIC improves the performance of the communications so that user-level applications can take advantage of network features (better latency & bandwidth, Broadcast, Channel Bonding).

CLIC (Communication in Linux Clusters) protocol



Improving communication in Clusters (III)

Network Interface Circuit (NIC)

Driver

User Processes

CLIC

Socket

TCP

IP

User

Kernel

CLIC avoids the TCP/IP stack

LAM- MPI/ CLIC

Network Interface (NIC)

Software Driver

MPI upper layer

CLIC

Lower MPI Layer (sockets)

TCP

IP

LAM-MPI/ TCP

Upper MPI Layer

MPI upper layer

Upper MPI Layer

RPI Functions

RPI Functions

LAM-MPI has been efficiently implemented on CLIC

10

0

0,5

1

1,5

2

2,5

3

3,5

4

10 100 1000 10000 100000Message Size (bytes)

CLIC

LAM-MPI/CLIC

LAM-MPI/TCP

CLIC has lowerSoftware overhead

Speedup w.r.t.

PVM-TCP

TCP/IP workswhilepacketscrossingthe network

0

0,5

1

1,5

2

2,5

3

3,5

4

10 100 1000 10000 100000Message Size (bytes)

CLIC

LAM-MPI/CLIC

LAM-MPI/TCP

CLIC has lowerSoftware overhead

Speedup w.r.t.

PVM-TCP

TCP/IP workswhilepacketscrossingthe network



Improving communication in Clusters (V)

• High improvement w.r.t. MPI/TCP and PVM/TCP

• MPI/CLIC provides a performance similar to CLIC



Network technology trends: Gigabit networks

Datacenters and networks are moving towards 1-10 Gigabit technologies

10 Gigabit/s

10 Gigabit/s

Servers

Clusters

Hard disks

Ethernet switch Infiniband switch

Infiniband array

Ethernet switch

Fibre channel switch

Fibre channel array

10 Gigabit/s

10 Gigabit/s

10 Gigabit/s



CLIC on Gigabit Ethernet (I)

Techniques implemented on CLIC to take advantage of the gigabit network bandwidths:

• Jumbo frames: use MTUs longer (up to MTU=9000 bytes) than the Ethernet standard (MTU=1500 bytes)

Reduce the number of interrupts and the overhead related with the communication protocol processing

• Coalesced interrupt: the NIC only interrupts the processors after a given time interval, or a given number of packet arrivals.

Reduce the number of generated interrupts (at the cost of a delay in the reception)

• 0-Copy: data to be sent are copied directly from the user memory space to the NIC (to receive data, only one copy is needed)



O.S. O.S. (driver)driver)

O.S. O.S. (driver)User driver)

O.S. (driver)driver)

TCP

(2 copies)

CLIC (Fast Ethernet)

(1-copy)

CLIC (Gigabit Ethernet)

(0-copy)

O.S. O.S. (driver)driver)

O.S. O.S. (driver)User driver)

O.S. (driver)driver)

TCP

(2 copies)

CLIC (Fast Ethernet)

(1-copy)

CLIC (Gigabit Ethernet)

(0-copy)

CLIC on Gigabit Ethernet (II)

Data to be sent go from user memory to system memory (1-copy), in order to build the packets, and then to the NIC

CLIC takes advantage of the new drivers for Gigabit network cards:

Data to be sent can go directly from user memory to the NIC

Data to be sent go from user memory to system memory, then another copy is done to build the packets (2 copies), and then to the NIC



0

150

300

450

600

750

1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07

0 - copy MTU 9000

1 - copy MTU 9000

0 - copy MTU 1500

1 - copy MTU 1500

Size (bytes)

Mbps

CLIC on Gigabit Ethernet (III)

Comparison of bandwidths provided by CLIC on Gigabit Ethernet with 0-copy/1-copy and MTU=9000/1500:

Using MTU=9000 bytes has more impact than using 0-copy

Latency=36μs (messages of 0 bytes)

50% of maximum bandwidth: 4 KBytes



Further Improvements: Intelligent NICThe emergence of fast, cheap embedded processors allows the use of Intelligent Network Interface Cards (INIC), including one or more processors, to assist communication by offloading protocol processing: the entire communication protocol is configured and moved to the INIC

Consequences:

• The load on the CPU (from the communication process) is reduced

It is possible for the applications to take advantage from overlapping communication and computation.

• The card becomes protocol-aware and can interact with the network without CPU intervention

The overall protocol latency is reduced as short messages do not have to cross the peripheral (PCI) bus and the CPU does not have to service an interrupt and perform a context switch for each one.

• The INIC can transfer data more efficiently to/from the CPU (small messages can be reassembled in the INIC and block-transferred to the main memory rather than a sequence of short DMA transfers).

• Implementing the communication protocols in the INIC contributes to reduce the effect of the I/O (PCI) bus bottleneck.



Outline of the talk1. Introduction: Grid computing



3. Grid- and cluster-aware program development environments1. PVMTB and MPITB performance




• M-files– interpreted– fast-prototyping– save & run

P-code M-file

C

C sourceMEX

interactive try-and-errorintegrated debugger

intermediate compile/link stepinvolved configuration

normal debuggerinvolved breakpoints

MATLAB interpreted environment

• MEX-files– compiled– lower-level

• computing-intensive• access to libraries• data export/import• hardware control



Parallel Toolboxes• Toolbox of MEX files, each PVM/MPI routine has its own MEX

Network

Operating System

MATLAB

PVM app

PVM MPI

MPI appPVMTB MPITB

MATLAB application

PVMTB: 93 cmds (interfaces 86 PVM calls)MPITB: 153 cmds (interfaces 135 MPI calls)

• Have been used for:– signal processing: wavelet transform (UGR, Spain)

– automatic control: real-time pH control (UNED, Spain)

– chemical engineering: chemical manufact. simul. (Carnegie Mellon, USA)

– nanoelectronics: nanoscale device simul. (CELAB, Purdue, USA)



Performance results“Performance of Parallel MATLAB Toolboxes”, VecPar’02, Porto, Portugal

Latency 1.8x

Overhead 20%@ 1500B



Application: pH controlS. Dormido (UNED, Spain) ASCC’02 Suntec, Singapore

“Dynamic Programming on clusters for solving Control problems”

PHm Flm

PHm

cluster de PCs

controladorpredictivo

ÁCIDOBASE

V

qcA

PHm

cB

u

PHm

Flm

PH-metro

Caudalímetro

Cluster “smaug”:16 Athlon K7 500MHz, 128MB, 7GB HDserver with 20GB HD, 2NICs, KVM

34

tp

inicio etapa 1 etapa 2



pH control: results

tp

inicio etapa 1 etapa 21 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

Nº de procesadores esclavos (M)

t p (

segu

ndo

s)

PD_SIN/E/ME con ligaduras

Aceleración lineal

S(M

), fa

cto

r de

ace

lera

ció

n

Nº de procesadores esclavos (M)

PD_SIN/E/ME sin ligaduras

PD_SIN/E/ME con ligaduras

PD_SIN/E/ME sin ligaduras



Application: nanoelectronicsS. Goasguen (Purdue, USA) IEEE-NANO’02 Arlington-VA, USA

“Parallelization of nanoMOS2.0 using a 100-nodes Linux cluster”



nanoMOS: cluster “superman”



nanoMOS: results

e ~ 84.2%

e ~ 88.8%

e ~ 95.3%

efficiency = 98.0%



Conclusions

Parallel Toolboxes PVMTB-MPITB:

• Fast learning / Fast prototyping of parallel applications on clusters

• Useful for research: small overhead, acceptable efficiency even with 120 CPUs

• Foreseen: efficiency improvement by compiling application M-files (MATLAB compiler) and by linking them against MPI/CLIC and PVM/CLIC.

Lightweight protocol CLIC:

• Portable to any Linux machine, benefits from OS resources/Gbps driver features

• Facilitated tracking of technology advances, it’s not NIC/CPU dependant

• Exposes latency/bandwidth improvements at application level (MPI-PVM/CLIC)

• Foreseen: INICs with embedded processors to offload protocol load from CPU and avoid I/O bus bottleneck (PCI)

Grids are made of clusters / other resources. Inside them, we want/need:

• Communication performance, promptly tracking technology advances (Gbps, INICs)

• Parallel application development environments, benefiting from those improvements

anguita , m.; cañas, a.; díaz, a.f.; fernández, f.j.; ortega, j.; prieto, a

Documents

distributed processing

communication cost

communication overhead

communication services

display of images

high resolution

efficient protocols

ph control