2/12/04 distributed & parallel computing cluster patrick mcguigan [email protected]

2/12/04

Distributed & Parallel Computing Cluster

Patrick [email protected]

2/12/04

DPCC Background

NSF funded Major Research Instrumentation (MRI) grant

Goals Personnel

– PI– Co-PI’s– Senior Personnel– Systems Administrator

2/12/04

DPCC Goals

Establish a regional Distributed and Parallel Computing Cluster at UTA (DPCC@UTA)

An inter-departmental and inter-institutional facility Facilitate collaborative research that requires large

scale storage (tera to peta bytes), high speed access (gigabit or more) and mega processing (100’s of processors)

2/12/04

DPCC Research Areas

Data mining / KDD– Association rules, graph-mining, Stream processing etc.

High Energy Physics– Simulation, moving towards a regional Dø Center

Dermatology/skin cancer– Image database, lesion detection and monitoring

Distributed computing– Grid computing, PICO

Networking– Non-intrusive network performance evaluation

Software Engineering– Formal specification and verification

Multimedia– Video streaming, scene analysis

Facilitate collaborative efforts that need high-performance computing

2/12/04

DPCC Personnel

PI Dr. Chakravarthy CO-PI’s

– Drs. Aslandogan, Das, Holder, Yu

Senior Personnel– Paul Bergstresser, Kaushik De, Farhad

Kamangar, David Kung, Mohan Kumar, David Levine, Jung-Hwan Oh, Gregely Zaruba

Systems Administrator– Patrick McGuigan

2/12/04

DPCC Components

Establish a distributed memory cluster (150+ processors)

Establish a Symmetric or shared multiprocessor system

Establish a large shareable high speed storage (100’s of Terabytes)

2/12/04

DPCC Cluster as of 2/1/2004

Located in 101 GACB Inauguration 2/23 as part of E-Week 5 racks of equipment + UPS

2/12/04

2/12/04

Photos (scaled for presentation)

2/12/04

DPCC Resources

97 machines– 81 worker nodes– 2 interactive nodes– 10 IDE based RAID servers– 4 nodes support Fibre Channel SAN

50+ TB storage– 4.5 TB in each IDE RAID– 5.2 TB in FC SAN

2/12/04

DPCC Resources (continued)

1 Gb/s network interconnections– core switch– satellite switches

1 Gb/s SAN network UPS

2/12/04

node2

node3

node4

node5

node6

node7

node8

node9

node10

node11

node12

node13

node14

Node15

node16

node1

node18

node19

node20

node21

node22

node23

node24

node25

node26

node27

node28

node29

node30

node31

node32

node17

master.dpcc.uta.edu grid.dpcc.uta.edu

gfsnode1

gfsnode2

gfsnode3

lockserver

ArcusII 5.2 TB

Brocade 3200

Open

Open

Open

Open

Foundry FastIron 800

raid3

raid6

raid8

raid9

raid10

node54

node55

node56

node57

node58

node59

node60

node61

node62

node53

raid4 IDE 6TB

Linksys 3512

node64

node65

node66

node67

node68

node69

node70

node71

node72

node63

raid5 IDE 6TB

Linksys 3512

node74

node75

node76

node77

node78

node79

node80

node81

node73

raid7 IDE 6TB

Linksys 3512

node34

node35

node36

node37

node38

node39

node40

node41

node42

node33

raid1 IDE 6TB

Linksys 3512

node44

node45

node46

node47

node48

node49

Node50

node51

node52

node43

raid2 IDE 6TB

Linksys 3512

100 Mbs switch

Campus Campus

DPCC Layout

2/12/04

DPCC Resource Details

Worker nodes– Dual Xeon processors

32 machines @ 2.4GHz 49 machines @ 2.6GHz

– 2 GB RAM– IDE Storage

32 machines @ 60 GB 49 machines @ 80 GB

– Redhat 7.3 Linux (2.4.20 kernel)

2/12/04

DPCC Resource Details (cont.)

Raid Server– Dual Xeon processors (2.4 GHz)– 2 GB RAM– 4 Raid Controllers

2 port controller (qty 1) Mirrored OS disks 8 port controller (qty 3) RAID5 with hot spare

– 24 250GB disks– 2 40GB disk– NFS used to support worker nodes

2/12/04

DPCC Resource Details (cont.)

FC SAN– RAID5 Array– 42 142GB FC disks– FC Switch– 3 GFS nodes

Dual Xeon (2.4 Ghz) 2 GB RAM Global File System (GFS) Serve to cluster via NFS

– 1 GFS Lockserver

2/12/04

Using DPCC

Two nodes available for interactive use– master.dpcc.uta.edu– grid.dpcc.uta.edu

More nodes are likely to support other services (Web, DB access)

Access through SSH (version 2 client)– Freeware Windows clients are available (ssh.com)– File transfers through SCP/SFTP

2/12/04

Using DPCC (continued)

User quotas not implemented on home directory yet. Be sensible in your usage.

Large data sets will be stored on RAID’s (requires coordination with sys admin)

All storage visible to all nodes.

2/12/04

Getting Accounts

Have your supervisor request account Account will be created Bring ID to 101 GACB to receive password

– Keep password safe Login to any interactive machine

– master.dpcc.uta.edu– grid.dpcc.uta.edu

USE yppasswd command to change password If you forget your password

– See me in my office, I will reset your password (with ID)– Call or e-mail me, I will reset your password to the original

password

2/12/04

User environment

Default shell is bash– Change with ypchsh– Customize user environment using startup files

.bash_profile (login session) .bashrc (non-login)

– Customize with statements like: export <variable>=<value> source <shell file>

– Much more information in man page

2/12/04

Program development tools

GCC 2.96– C– C++– Java (gcj)– Objective C– Chill

Java– JDK Sun J2SDK 1.4.2

2/12/04

Development tools (cont.)

Python– python = version 1.5.2– python2 = version 2.2.2

Perl– Version 5.6.1

Flex, Bison, gdb… If your favorite tool is not available, we’ll

consider adding it!

2/12/04

Batch Queue System

OpenPBS– Server runs on master– pbs_mom runs on worker nodes– Scheduler runs on master– Jobs can be submitted from any interactive node– User commands

qsub – submit a job for execution qstat – determine status of job, queue, server qdel – delete a job from the queue qalter – modify attributes of a job

– Single queue (workq)

2/12/04

PBS qsub

qsub used to submit jobs to PBS– A job is represented by a shell script– Shell script can alter environment and proceed

with execution– Script may contain embedded PBS directives– Script is responsible for starting parallel jobs (not

PBS)

2/12/04

Hello World

[mcguigan@master pbs_examples]$ cat helloworldecho Hello World from $HOSTNAME

[mcguigan@master pbs_examples]$ qsub helloworld15795.master.cluster

[mcguigan@master pbs_examples]$ lshelloworld helloworld.e15795 helloworld.o15795

[mcguigan@master pbs_examples]$ more helloworld.o15795 Hello World from node1.cluster

2/12/04

Hello World (continued)

Job ID is returned from qsub Default attributes allow job to run

– 1 Node– 1 CPU– 36:00:00 CPU time

standard out and standard error streams are returned

2/12/04

Hello World (continued)

Environment of job– Defaults to login shell (overide with #!) or –S switch– Login environment variable list with PBS additions:

PBS_O_HOST PBS_O_QUEUE PBS_O_WORKDIR PBS_ENVIRONMENT PBS_JOBID PBS_JOBNAME PBS_NODEFILE PBS_QUEUE

– Additional environment variables may be transferred using “-v” switch

2/12/04

PBS Environment Variables

PBS_ENVIRONMENT PBS_JOBCOOKIE PBS_JOBID PBS_JOBNAME PBS_MOMPORT PBS_NODENUM PBS_O_HOME PBS_O_HOST PBS_O_LANG PBS_O_LOGNAME PBS_O_MAIL PBS_O_PATH PBS_O_QUEUE PBS_O_SHELL PBS_O_WORKDIR PBS_QUEUE=workq PBS_TASKNUM=1

2/12/04

qsub options

Output streams:– -e (error output path)– -o (standard output path)– -j (join error + output as either output or error)

Mail options– -m [aben] when to mail (abort, begin, end, none)– -M who to mail

Name of job– -N (15 printable characters MAX first is alphabetical)

Which queue to submit job to– -q [name] Unimportant for now

Environment variables– -v pass specific variables– -V pass all environment variables of qsub to job

Additional attributes-w specify dependencies

2/12/04

Qsub options (continued)

-l switch used to specify needed resources– Number of nodes

nodes = x

– Number of processors ncpus = x

– CPU time cput=hh:mm:ss

– Walltime walltime=hh:mm:ss

See man page for pbs_resources

2/12/04

Hello World

qsub –l nodes=1 –l ncpus=1 –l cput=36:00:00 –N helloworld –m a –q workq helloworld

Options can be included in script:#PBS -l nodes=1#PBS -l ncpus=1#PBS -m a#PBS -N helloworld2#PBS -l cput=36:00:00

echo Hello World from $HOSTNAME

2/12/04

qstat

Used to determine status of jobs, queues, server $ qstat $ qstat <job id> Switches

– -u <user> list jobs of user– -f provides extra output– -n provides nodes given to job– -q status of the queue– -i show idle jobs

2/12/04

qdel & qalter

qdel used to remove a job from a queue– qdel <job ID>

qalter used to alter attributes of currently queued job– qalter <job id> attributes (similar to qsub)

2/12/04

Processing on a worker node

All RAID storage visible to all nodes– /dataxy where x is raid ID, y is Volume (1-3)– /gfsx where x is gfs volume (1-3)

Local storage on each worker node– /scratch

Data intensive applications should copy input data (when possible) to /scratch for manipulation and copy results back to raid storage

2/12/04

Parallel Processing

MPI installed on interactive + worker nodes– MPICH 1.2.5– Path: /usr/local/mpich-1.2.5

Asking for multiple processors– -l nodes=x– -l ncpus=2x

2/12/04

Parallel Processing (continued)

PBS node file created when job executes Available to job via $PBS_NODEFILE Used to start processes on remote nodes

– mpirun– rsh

2/12/04

Using node file (example job)

#!/bin/sh

#PBS -m n

#PBS -l nodes=3:ppn=2

#PBS -l walltime=00:30:00

#PBS -j oe

#PBS -o helloworld.out

#PBS -N helloword_mpi

NN=`cat $PBS_NODEFILE | wc -l`

echo "Processors received = "$NN

echo "script running on host `hostname`"

cd $PBS_O_WORKDIR

echo

echo "PBS NODE FILE"

cat $PBS_NODEFILE

echo

/usr/local/mpich-1.2.5/bin/mpirun -machinefile $PBS_NODEFILE -np $NN ~/mpi-example/helloworld

2/12/04

MPI

Shared Memory vs. Message Passing MPI

– C based library to allow programs to communicate– Each cooperating execution is running the same program

image– Different images can do different computations based on

notion of rank– MPI primitives allow for construction of more sophisticated

synchronization mechanisms (barrier, mutex)

2/12/04

helloworld.c

#include <stdio.h>#include <unistd.h>#include <string.h>#include "mpi.h"

int main( argc, argv ) int argc; char **argv;{ int rank, size; char host[256]; int val;

val = gethostname(host,255); if ( val != 0 ){ strcpy(host,"UNKNOWN"); }

MPI_Init( &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( "Hello world from node %s: process %d of %d\n", host, rank, size ); MPI_Finalize(); return 0;}

2/12/04

Using MPI programs

Compiling– $ /usr/local/mpich-1.2.5/bin/mpicc helloworld.c

Executing– $ /usr/local/mpich-1.2.5/bin/mpirun <options> \

helloworld– Common options:

-np number of processes to create -machinefile list of nodes to run on

2/12/04

Resources for MPI

http://www-hep.uta.edu/~mcguigan/dpcc/mpi– MPI documentation

http://www-unix.mcs.anl.gov/mpi/indexold.html– Links to various tutorials

Parallel programming course

2/12/04 distributed & parallel computing cluster patrick mcguigan [email protected]

Documents

dpcc layout slide

dpcc cluster

kernel slide

scpsftp slide

presentation slide

fc san slide

dpcc goals

uta dpcc