Title: Parallel computing on nanco an introductory course
1Parallel computing on nanco- an introductory
course
- Anne Weill Zrahia
- Technion,Computer Center
- May 2007
2(No Transcript)
3Parallel Power for HPC
- A closely coupled, scalable set of
interconnected computer system, sharing common
hardware and software infrastructure, providing a
parallel set of resources to applications for
improved performance, throughput and availability.
4Resources needed for applications arising from
Nanotechnology
- Large memory Tbytes
- High floating point computing speed Tflops
- High data throughput state of the art
5Parallel Programming on the Nanco
- Parallelization Concepts
- Nanco Computer Design
- Efficient Scalar Design
- Parallel Programming -MPI
- 5) Queuing system - SGE
63) Compilers and tools
7Parallel classification
- Parallel architectures
-
- Shared Memory /
- Distributed Memory
- Programming paradigms
- Data parallel /
- Message passing
8Shared Memory
- Each processor can access any part of the memory
- Access times are uniform (in principle)
- Easier to program (no explicit message passing)
- Bottleneck when several tasks access same
location
9SMP architecture
P
P
P
P
Memory
10Distributed Memory
- Processor can only access local memory
- Access times depend on location
- Processors must communicate via explicit message
passing
11Distributed Memory
Processor Memory
Processor Memory
Interconnection network
12Message Passing Programming
- Separate program on each processor
- Local Memory
- Control over distribution and transfer of data
- Additional complexity of debugging due to
communications
13Why not a cluster
- Single SMP system easier to purchase/maintain
- Ease of programming in SMP systems
14Why a cluster
- Scalability
- Total available physical RAM
- Reduced cost
- But
15Performance issues
- Concurrency ability to perform actions
simultaneously - Scalability performance is not impaired by
increasing number of processors - Locality high ration of local memory
accesses/remote memory accesses (or low
communication)
16SP2 Benchmark
- Goal Checking performance of real world
applications on the SP2 - Execution time (seconds)CPU time for
applications - Speedup
- Execution time for 1 processor
- ---------------------------------
--- - Execution time for p processors
17(No Transcript)
182) Nanco design
19Nanco architecture
20Configuration
M
M
M
P
P
P
P
P
P
node2
node64
node1
Infiniband Switch
21Configuration
- 64 dual-processor, dual core compute nodes, each
dual-core Opteron Rev. F - 8GB RAM memory/node
- 2 master nodes for H/A , also Opterons
- Infiniband Interconnect switch HCAs
- Netapp storage
22(No Transcript)
232) Parallel Programming-MPI
24AMD Opteron processor
25Memory bottleneck
26AMD Hypertransport
27(No Transcript)
28How does this reflect on performance?
29Performance
- Access to local memory 1hop
- Access to 2nd processor memory 2hops
- Prefetch can be useful for predictable patterns
- Multithreading can be used at node level
30WHAT is MPI?
- A message- passing library specification
- Extended message-passing model
- Not specific to implementation or computer
31BASICS of MPI PROGRAMMING
- MPI is a message-passing library
- Assumes a distributed memory architecture
- Includes routines for performing communication
(exchange of data and synchronization) among the
processors.
32Message Passing
- Data transfer synchronization
- Synchronization the act of bringing one or more
processes to known points in their execution - Distributed memory memory split up into
segments, each may be accessed by only one
process.
33Message Passing
May I send?
yes
Send data
34MPI STANDARD
- Standard by consensus, designed in an open forum
- Introduced by the MPI FORUM in May 1994, updated
in June 1995. - MPI-2 (1998) produces extensions to the MPI
standard
35Why use MPI ?
- Standardization
- Portability
- Performance
- Richness
- Designed to enable libraries
36Writing an MPI Program
- If there is a serial version , make sure it is
debugged - If not, try to write a serial version first
- When debugging in parallel , start with a few
nodes first.
37Format of MPI routines
38Six useful MPI functions
39Communication routines
40End MPI part of program
41- program hello
- include mpif.h status(MPI_STATUS_SIZE)
character12 message call MPI_INIT(ierror) call
MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call
MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag
100 if(rank .eq. 0) then message 'Hello,
world' do i1, size-1 call
MPI_SEND(message, 12, MPI_CHARACTER , i,
tag,MPI_COMM_WORLD,ierror) - enddo
- else
- call MPI_RECV(message, 12, MPI_CHARACTER,
0,tag,MPI_COMM_WORLD, status, ierror) - endif
- print, 'node', rank, '', message
-
- call MPI_FINALIZE(ierror)
- end
42int main( int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() return
0
43MPI Messages
- DATA data to be sent
- ENVELOPE information to route the data.
44Description of MPI_Send (MPI_Recv)
45Description of MPI_Send (MPI_Recv)
46Some useful remarks
- Source MPI_ANY_SOURCE means that any source is
acceptable - Tags specified by sender and receiver must match,
or MPI_ANY_TAG any tag is acceptable - Communicator must be the same for send/receive.
Usually MPI_COMM_WORLD
47Broadcast
- Send data on one node to all other nodes in
communicator. - MPI_Bcast(buffer, count, datatype,root,comm,ierr)
48Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
49Performance evaluation
- Fortran
- Real8 t1
- T1 MPI_Wtime() ! Returns elapsed time
- C
- double t1
- t1 MPI_Wtime ()
50MPI References
- The MPI Standard
- www-unix.mcs.anl.gov/mpi/index.html
- Parallel Programming with MPI,Peter S.
Pacheco,Morgan Kaufmann,1997 - Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
The MIT Press,1999.
51Getting started
- Security
- Logging in
- Shell environment
- Transferring file
52System access-security
- Secure access
- X-tunelling (for graphics
- Can use ssh X for tunnelling
53Login Environment
- Paths and environment variables have been setup
(change things with care) - TCSH is the default (can transfer to bash if you
like) - User modifiable environment variables are in
.cshrc in home directory - Home directory is in /u/courseXX
54Compilers
- Options are gcc, gcc4, suncc for C
- g , sunCC for C
- G77(no F90) , gfortran,sunf90 for
Fortran77/Fortran90
55Compilation with MPI
- Most MPI implementation support C,C,Fortran77
and Fortran90 bindings. - Scripts for compilation of type mpif77,mpif90,
mpicc etc. - You can specify generic compiler options
56Flags for compilation
- sunf90 fast -xO5 -xarchamd64a myprog.f o myprog
- Gcc O3 marchopteron myprog.c o myprog
575) Queuing system Sun Grid Engine
58Sun Grid Engine
- Open-source batch queuing system similar to PBS
or LSF - Automatically runs jobs on less loaded nodes
- Queue jobs for later execution to avoid
overloading of system
59SGE properties
- Can schedule serial or MPI jobs
-
- - serial jobs run in individual host queues
- - parallel jobs must include a parallel
environment request
60Working with SGE jobs
- There are command for querying or modifying the
status of a job running or queued by SGE - - qsub submit a job
- - qstat - query the status of a job
- - qdel - deleting a job from SGE
61Submitting a serial job
- Create a submit script (basic.sh)
- !/bin/sh
- scalar example
- Echo This code is running on hostname date
- end of script
62Submitting a serial job
- The job is submitted to SGE using the qsub
command - qsub basic.sh
632 ways of submitting
- With arguments
- qsub o outputfile j y cwd basic.sh
- In submit script
64Monitoring a job - QSTAT
- To list the status and node properties
- Qstat
65Monitoring a job - qstat
- qstat output important fields
- Job identifier
- Job status
- - qw- queued and waiting
- - t job transferring and about to start
- - r job running on listed hosts
- - d job has been marked for deletion
66Deleting a job - QDEL
- Single job qdel 151
- List of jobs
- qdel 151 152 153
- All jobs under user
- qdel u artemis
67Output produced by jobs
- By default , we get 2 files
- ltscriptgt.o.ltjobidgt std output
- ltscriptgt.e.ltjobidgt error messages
- For parallel jobs, also
- ltscriptgt.po.ltjobidgt list of processors the
job ran on
68Debugging job failures
69Script for submitting parallel jobs
- Mpisub gets as input number of processors and
executable - Ex mpisub 8 ltmyappgt
70Parallel MPI jobs and SGE
- SGE uses the concept of a parallel environment
(PE) - Several PEs can coexist on the machine
- Each host has an associated queue and resource
list (time,memory) - A PE is a list of hosts along with a set number
of job slots
71Queues definition
- System job execution policy
- Resource allocation
- Resource limits
- Accounting
72Two ways to run a batch job
(1) Parameters in command line
(2) Parameters in script file
73QSUB options
74Parix run limits
(1) NQS queues on parix
(2) Interactive Maximum cputime 15 minutes
75Output of command qstat a
76Exercise 1 login and submit a job