Cluster Workshop

About This Presentation

Title:

Cluster Workshop

Description:

Title: Introduction to Parallelism Author: Project Assistant Last modified by: Morris Law Created Date: 7/27/2006 7:13:04 AM Document presentation format – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 64

Provided by: ProjectA1

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Workshop

1
Cluster Workshop
For COMP RPG students 17 May, 2010 High
Performance Cluster Computing Centre
(HPCCC) Faculty of Science Hong Kong Baptist
University
2
Outline

Overview of Cluster Hardware and Software
Basic Login and Running Program in a job queuing
system
Introduction to Parallelism
Why Parallelism
Cluster Parallelism
Open MP
Message Passing Interface
Parallel Program Examples
Policy for using sciblade.sci.hkbu.edu.hk
http//www.sci.hkbu.edu.hk/hpccc/sciblade

2
3
Overview of Cluster Hardware and Software
4
Cluster Hardware

This 256-node PC cluster (sciblade) consist of
Master node x 2
IO nodes x 3 (storage)
Compute nodes x 256
Blade Chassis x 16
Management network
Interconnect fabric
1U console KVM switch
Emerson Liebert Nxa 120k VA UPS

4
5
Sciblade Cluster
256-node clusters supported by fund from RGC
5
6
Hardware Configuration

Master Node
Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)
16GB RAM, 73GB x 2 SAS drive
IO nodes (Storage)
Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)
16GB RAM, 73GB x 2 SAS drive
3TB storage Dell PE MD3000
Compute nodes x 256 each with
Dell PE M600 blade server w/ Infiniband network
2x Xeon E5430 2.66GHz (Quad Core)
16GB RAM, 73GB SAS drive

6
7
Hardware Configuration

Blade Chassis x 16
Dell PE M1000e
Each hosts 16 blade servers
Management Network
Dell PowerConnect 6248 (Gigabit Ethernet) x 6
Inerconnect fabric
Qlogic SilverStorm 9120 switch
Console and KVM switch
Dell AS-180 KVM
Dell 17FP Rack console
Emerson Liebert Nxa 120kVA UPS

7
8
Software List

Operating System
ROCKS 5.1 Cluster OS
CentOS 5.3 kernel 2.6.18
Job Management System
Portable Batch System
MAUI scheduler
Compilers, Languages
Intel Fortran/C/C Compiler for Linux V11
GNU 4.1.2/4.4.0 Fortran/C/C Compiler

8
9
Software List

Message Passing Interface (MPI) Libraries
MVAPICH 1.1
MVAPICH2 1.2
OPEN MPI 1.3.2
Mathematic libraries
ATLAS 3.8.3
FFTW 2.1.5/3.2.1
SPRNG 2.0a(C/Fortran) /4.0(C/Fortran)

9
10
Software List

Molecular Dynamics Quantum Chemistry
Gromacs 4.0.7
Gamess 2009R1,
Gaussian 03
Namd 2.7b1
Third-party Applications
FDTD simulation
MATLAB 2008b
TAU 2.18.2, VisIt 1.11.2
Xmgrace 5.1.22
etc

10
11
Software List

Queuing system
Torque/PBS
Maui scheduler
Editors
vi
emacs

11
12
Hostnames

Master node
External sciblade.sci.hkbu.edu.hk
Internal frontend-0
IO nodes (storage)
pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2
Compute nodes
compute-0-0.local, , compute-0-255.local

12
13
Basic Login and Running Program in a Job Queuing
System
14
Basic login

Remote login to the master node
Terminal login
using secure shell
ssh -l username sciblade.sci.hkbu.edu.hk
Graphical login
PuTTY vncviewer e.g.
username_at_sciblade vncserver
New sciblade.sci.hkbu.edu.hk3 (username)'
desktop is sciblade.sci.hkbu.edu.hk3
It means that your session will run on display 3.

14
15
Graphical login

Using PuTTY to setup a secured connection Host
Namesciblade.sci.hkbu.edu.hk

15
16
Graphical login (cont)

ssh protocol version

16
17
Graphical login (cont)

Port 5900 display number (i.e. 3 in this case)

17
18
Graphical login (cont)

Next, click Open, and login to sciblade
Finally, run VNC Viewer on your PC, and enter
"localhost3" 3 is the display number
You should terminate your VNC session after you
have finished your work. To terminate your VNC
session running on sciblade, run the command
username_at_tdgrocks vncserver kill 3

18
19
Linux commands

Both master and compute nodes are installed with
Linux
Frequently used Linux command in PC cluster
http//www.sci.hkbu.edu.hk/hpccc/sciblade/faq_scib
lade.php

cp cp f1 f2 dir1 copy file f1 and f2 into directory dir1
mv mv f1 dir1 move/rename file f1 into dir1
tar tar xzvf abc.tar.gz Uncompress and untar a tar.gz format file
tar tar czvf abc.tar.gz abc create archive file with gzip compression
cat cat f1 f2 type the content of file f1 and f2
diff diff f1 f2 compare text between two files
grep grep student search all files with the word student
history history 50 find the last 50 commands stored in the shell
kill kill -9 2036 terminate the process with pid 2036
man man tar displaying the manual page on-line
nohup nohup runmatlab a run matlab (a.m) without hang up after logout
ps ps -ef find out all process run in the systems
sort sort -r -n studno sort studno in reverse numerical order
19
20
ROCKS specific commands

ROCKS provides the following commands for users
to run programs in all compute node. e.g.
cluster-fork
Run program in all compute nodes
cluster-fork ps
Check user process in each compute node
cluster-kill
Kill user process at one time
tentakel
Similar to cluster-fork but run faster

20
21
Ganglia

Web based management and monitoring
http//sciblade.sci.hkbu.edu.hk/ganglia

21
22
Why Parallelism
23
Why Parallelism Passively

Suppose you are using the most efficient
algorithm with an optimal implementation, but the
program still takes too long or does not even fit
onto your machine
Parallelization is the last chance.

23
24
Why Parallelism Initiative

Faster
Finish the work earlier
Same work in shorter time
Do more work
More work in the same time
Most importantly, you want to predict the result
before the event occurs

24
25
Examples

Many of the scientific and engineering problems
require enormous computational power.
Following are the few fields to mention
Quantum chemistry, statistical mechanics, and
relativistic physics
Cosmology and astrophysics
Computational fluid dynamics and turbulence
Material design and superconductivity
Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity,
and cell modeling
Medicine, and modeling of human organs and bones
Global weather and environmental modeling
Machine Vision

25
26
Parallelism

The upper bound for the computing power that can
be obtained from a single processor is limited by
the fastest processor available at any certain
time.
The upper bound for the computing power available
can be dramatically increased by integrating a
set of processors together.
Synchronization and exchange of partial results
among processors are therefore unavoidable.

26
27
Multiprocessing Clustering
Parallel Computer Architecture
Distributed Memory Cluster
Shared Memory Symmetric multiprocessors (SMP)
27
28
Clustering Pros and Cons

Advantages
Memory scalable to number of processors.
?Increase number of processors, size of
memory and bandwidth as well.
Each processor can rapidly access its own memory
without interference
Disadvantages
Difficult to map existing data structures to this
memory organization
User is responsible for sending and receiving
data among processors

28
29
TOP500 Supercomputer Sites (www.top500.org)
29
30
Cluster Parallelism
31
Parallel Programming Paradigm

Multithreading
OpenMP
Message Passing
MPI (Message Passing Interface)
PVM (Parallel Virtual Machine)

Shared memory only
Shared memory, Distributed memory
31
32
Distributed Memory

Programmers view
Several CPUs
Several block of memory
Several threads of action
Parallelization
Done by hand
Example
MPI

32
33
Message Passing Model
Message Passing The method by which data from one
processor's memory is copied to the memory of
another processor.
Process A process is a set of executable
instructions (program) which runs on a processor.
Message passing systems generally associate only
one process per processor, and the terms
"processes" and "processors" are used
interchangeably
33
34
OpenMP
35
OpenMP Mission

The OpenMP Application Program Interface (API)
supports multi-platform shared-memory parallel
programming in C/C and Fortran on all
architectures, including Unix platforms and
Windows NT platforms.
Jointly defined by a group of major computer
hardware and software vendors.
OpenMP is a portable, scalable model that gives
shared-memory parallel programmers a simple and
flexible interface for developing parallel
applications for platforms ranging from the
desktop to the supercomputer.

35
36
OpenMP compiler choice

gcc 4.40 or above
compile with -fopenmp
Intel 10.1 or above
compile with Qopenmp on Windows
compile with openmp on linux
PGI compiler
compile with mp
Absoft Pro Fortran
compile with -openmp

36
37
Sample openmp example

include ltomp.hgt
include ltstdio.hgt
int main()
pragma omp parallelprintf("Hello from thread
d, nthreads d\n", omp_get_thread_num(),
omp_get_num_threads())

37
38
serial-pi.c

include ltstdio.hgt
static long num_steps 10000000
double step
int main ()
int i double x, pi, sum 0.0
step 1.0/(double) num_steps
for (i0ilt num_steps i)
x (i0.5)step
sum sum 4.0/(1.0xx)
pi step sum
printf("Est Pi f\n",pi)

38
39
Openmp version of spmd-pi.c

include ltomp.hgt
include ltstdio.hgt
static long num_steps 10000000
double step
define NUM_THREADS 8
int main ()
int i, nthreads double pi, sumNUM_THREADS
step 1.0/(double) num_steps
omp_set_num_threads(NUM_THREADS)
pragma omp parallel
int i, id,nthrds
double x
id omp_get_thread_num()
nthrds omp_get_num_threads()
if (id 0) nthreads nthrds
for (iid, sumid0.0ilt num_steps
iinthrds)
x (i0.5)step
sumid 4.0/(1.0xx)

39
40
Message Passing Interface (MPI)
41
MPI

Is a library but not a language, for parallel
programming
An MPI implementation consists of
a subroutine library with all MPI functions
include files for the calling application program
some startup script (usually called mpirun, but
not standardized)
Include the lib file mpi.h (or however called)
into the source code
Libraries available for all major imperative
languages (C, C, Fortran )

41
42
General MPI Program Structure
42
43
Sample Program Hello World!

In this modified version of the "Hello World"
program, each processor prints its rank as well
as the total number of processors in the
communicator MPI_COMM_WORLD.
Notes
Makes use of the pre-defined communicator
MPI_COMM_WORLD.
Not testing for error status of routines!

43
44
Sample Program Hello World!
include ltstdio.hgt include mpi.h
// MPI compiler header file void
main(int argc, char argv) int nproc,myrank,i
err ierrMPI_Init(argc,argv)
// MPI initialization // Get number of MPI
processes MPI_Comm_size(MPI_COMM_WORLD,nproc)
// Get process id for this
processor MPI_Comm_rank(MPI_COMM_WORLD,myrank)
printf (Hello World!! Im process d of
d\n,myrank,nproc) ierrMPI_Finalize()
// Terminate all MPI
processes
44
45
Performance

When we write a parallel program, it is important
to identify the fraction of the program that can
be parallelized and to maximize it.
The goals are
load balance
memory usage balance
minimize communication overhead
reduce sequential bottlenecks
scalability

45
46
Compiling Running MPI Programs

Using mvapich 1.1
Setting path, at the command prompt, type
export PATH/u1/local/mvapich1/binPATH
(uncomment this line in .bashrc)
Compile using mpicc, mpiCC, mpif77 or mpif90,
e.g.
mpicc o cpi cpi.c
Prepare hostfile (e.g. machines) number of
compute nodes
Compute-0-0
Compute-0-1
Compute-0-2
Compute-0-3
Run the program with a number of processor node
mpirun np 4 machinefile machines ./cpi

46
47
Compiling Running MPI Programs

Using mvapich 1.2
Prepare .mpd.conf and .mpd.passwd and saved in
your home directory
MPD_SECRETWORDgde1234-3
(you may set your own secret word)
Setting environment for mvapich 1.2
export MPD_BIN/u1/local/mvapich2
export PATHMPD_BINPATH
(uncomment this line in .bashrc)
Compile using mpicc, mpiCC, mpif77 or mpif90,
e.g.
mpicc o cpi cpi.c
Prepare hostfile (e.g. machines) one hostname
per line like previous section

47
48
Compiling Running MPI Programs

Pmdboot with the hostfile
mpdboot n 4 f machines
Run the program with a number of processor node
mpiexec np 4 ./cpi
Remember to clean after running jobs by
mpdallexit
mpdallexit

48
49
Compiling Running MPI Programs

Using openmpi1.2
Setting environment for openmpi
export LD-LIBRARY_PATH/u1/local/openmpi/
libLD-LIBRARY_PATH
export PATH/u1/local/openmpi/binPATH
(uncomment this line in .bashrc)
Compile using mpicc, mpiCC, mpif77 or mpif90,
e.g.
mpicc o cpi cpi.c
Prepare hostfile (e.g. machines) one hostname
per line like previous section
Run the program with a number of processor node
mpirun np 4 machinefile machines ./cpi

49
50
Submit parallel jobs into torque batch queue

Prepare a job script, say omp.pbs like the
following
!/bin/sh
Job name
PBS -N OMP-spmd
Declare job non-rerunable
PBS -r n
Mail to user
PBS -m ae
Queue name (small, medium, long, verylong)
Number of nodes
PBS -l nodes1ppn8
PBS -l walltime000800
cd PBS_O_WORKDIR
export OMP_NUM_THREADS8
./omp-test
./serial-pi
./omp-spmd-pi
Submit it using qsub
qsub omp.pbs

50
51
Another example of pbs scripts

Prepare a job script, say scripts.sh like the
following
!/bin/sh
Job name
PBS -N Sorting
Declare job non-rerunable
PBS -r n
Number of nodes
PBS -l nodes4
PBS -l walltime080000
This job's working directory
echo Working directory is PBS_O_WORKDIR
cd PBS_O_WORKDIR
echo Running on host hostname
echo Time is date
echo Directory is pwd
echo This jobs runs on the following processors
echo cat PBS_NODEFILE
Define number of processors
NPROCSwc -l lt PBS_NODEFILE

51
52
Parallel Program Examples
53
Example 1 Estimation of Pi (OpenMP)

include ltomp.hgt
include ltstdio.hgt
static long num_steps 10000000
double step
define NUM_THREADS 8
int main ()
int i, nthreads double pi, sumNUM_THREADS
step 1.0/(double) num_steps
omp_set_num_threads(NUM_THREADS)
pragma omp parallel
int i, id,nthrds
double x
id omp_get_thread_num()
nthrds omp_get_num_threads()
if (id 0) nthreads nthrds
for (iid, sumid0.0ilt num_steps
iinthrds)
x (i0.5)step
sumid 4.0/(1.0xx)

53
54
Example 2a Sorting quick sort

The quick sort is an in-place, divide-and-conquer,
massively recursive sort.
The efficiency of the algorithm is majorly
impacted by which element is chosen as the pivot
point.
The worst-case efficiency of the quick sort,
O(n2), occurs when the list is sorted and the
left-most element is chosen.
If the data to be sorted isn't random, randomly
choosing a pivot point is recommended. As long as
the pivot point is chosen randomly, the quick
sort has an algorithmic complexity of O(n log n).
Pros Extremely fast.
Cons Very complex algorithm, massively recursive

54
55
Quick Sort Performance
Processes Time
1 0.410000
2 0.300000
4 0.180000
8 0.180000
16 0.180000
32 0.220000
64 0.680000
128 1.300000
55
56
Example 2b Sorting -Bubble Sort

The bubble sort is the oldest and simplest sort
in use. Unfortunately, it's also the slowest.
The bubble sort works by comparing each item in
the list with the item next to it, and swapping
them if required.
The algorithm repeats this process until it makes
a pass all the way through the list without
swapping any items (in other words, all items are
in the correct order).
This causes larger values to "bubble" to the end
of the list while smaller values "sink" towards
the beginning of the list.

56
57
Bubble Sort Performance
Processes Time
1 3242.327
2 806.346
4 276.4646
8 78.45156
16 21.031
32 4.8478
64 2.03676
128 1.240197
57
58
Monte Carlo Integration

"Hit and miss" integration
The integration scheme is to take a large number
of random points and count the number that are
within f(x) to get the area

58
59
Monte Carlo Integration

Monte Carlo Integration to Estimate Pi

59
60
Example 2 Prime prime/prime.c prime/prime.f90 pr
ime/primeParallel.c prime/Makefile prime/machines
Compile by the command make Run the serial
program by ./primeC or ./primeF Run the
parallel program by mpirun np 4 machinefile
machines ./primeMPI
Example 1 omp omp/test-omp.c omp/serial-pi.c omp
/spmd-pi.c Compile program by the command
make Run the program in parallel
by ./omp-spmd-pi Submit job to PBS by qsub
omp.pbs
Example 4 pmatlab pmatlab/startup.m pmatlab/RUN.
m pmatlab/sample-pi.m Submit job to PBS by qsub
Qpmatlab.pbs
Example 3 Sorting sorting/qsort.c sorting/bubble
sort.c sorting/script.sh sorting/qsort
sorting/bubblesort Submit job to PBS queuing
system by qsub script.sh
60
61
Policy for using sciblade.sci.hkbu.edu.hk
62
Policy

Every user shall apply for his/her own computer
user account to login to the master node of the
PC cluster, sciblade.sci.hkbu.edu.hk.
The account must not be shared his/her account
and password with the other users.
Every user must deliver jobs to the PC cluster
from the master node via the PBS job queuing
system. Automatically dispatching of job using
scripts or robots are not allowed.
Users are not allowed to login to the compute
nodes.
Foreground jobs on the PC cluster are restricted
to program testing and the time duration should
not exceed 1 minutes CPU time per job.

63
Policy (continue)

Any background jobs run on the master node or
compute nodes are strictly prohibited and will be
killed without prior notice.
The current restrictions of the job queuing
system are as follows,
The maximum number of running jobs in the job
queue is 8.
The maximum total number of CPU cores used in one
time cannot exceed 512.
The restrictions in item 7 will be reviewed
timely for the growing number of users and the
computation need.