Parallel computing on nanco an introductory course - PowerPoint PPT Presentation

1 / 76

About This Presentation

Title:

Parallel computing on nanco an introductory course

Description:

Parallel Power for HPC. A closely coupled, scalable set of ... T1= MPI_Wtime() ! Returns elapsed time. C: double t1 ; t1 =MPI_Wtime (); MPI References ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 77

Provided by: suz86

Category:

more less

Transcript and Presenter's Notes

Title: Parallel computing on nanco an introductory course

1
Parallel computing on nanco- an introductory
course

Anne Weill Zrahia
Technion,Computer Center
May 2007

2
(No Transcript)
3
Parallel Power for HPC

A closely coupled, scalable set of
interconnected computer system, sharing common
hardware and software infrastructure, providing a
parallel set of resources to applications for
improved performance, throughput and availability.

4
Resources needed for applications arising from
Nanotechnology

Large memory Tbytes
High floating point computing speed Tflops
High data throughput state of the art

5
Parallel Programming on the Nanco

Parallelization Concepts
Nanco Computer Design
Efficient Scalar Design
Parallel Programming -MPI
5) Queuing system - SGE

6
3) Compilers and tools
7
Parallel classification

Parallel architectures
Shared Memory /
Distributed Memory
Programming paradigms
Data parallel /
Message passing

8
Shared Memory

Each processor can access any part of the memory
Access times are uniform (in principle)
Easier to program (no explicit message passing)
Bottleneck when several tasks access same
location

9
SMP architecture
P
P
P
P
Memory
10
Distributed Memory

Processor can only access local memory
Access times depend on location
Processors must communicate via explicit message
passing

11
Distributed Memory
Processor Memory
Processor Memory
Interconnection network
12
Message Passing Programming

Separate program on each processor
Local Memory
Control over distribution and transfer of data
Additional complexity of debugging due to
communications

13
Why not a cluster

Single SMP system easier to purchase/maintain
Ease of programming in SMP systems

14
Why a cluster

Scalability
Total available physical RAM
Reduced cost
But

15
Performance issues

Concurrency ability to perform actions
simultaneously
Scalability performance is not impaired by
increasing number of processors
Locality high ration of local memory
accesses/remote memory accesses (or low
communication)

16
SP2 Benchmark

Goal Checking performance of real world
applications on the SP2
Execution time (seconds)CPU time for
applications
Speedup
Execution time for 1 processor
---------------------------------
---
Execution time for p processors

17
(No Transcript)
18
2) Nanco design
19
Nanco architecture
20
Configuration
M
M
M
P
P
P
P
P
P
node2
node64
node1
Infiniband Switch
21
Configuration

64 dual-processor, dual core compute nodes, each
dual-core Opteron Rev. F
8GB RAM memory/node
2 master nodes for H/A , also Opterons
Infiniband Interconnect switch HCAs
Netapp storage

22
(No Transcript)
23
2) Parallel Programming-MPI
24
AMD Opteron processor
25
Memory bottleneck
26
AMD Hypertransport
27
(No Transcript)
28
How does this reflect on performance?

29
Performance

Access to local memory 1hop
Access to 2nd processor memory 2hops
Prefetch can be useful for predictable patterns
Multithreading can be used at node level

30
WHAT is MPI?

A message- passing library specification
Extended message-passing model
Not specific to implementation or computer

31
BASICS of MPI PROGRAMMING

MPI is a message-passing library
Assumes a distributed memory architecture
Includes routines for performing communication
(exchange of data and synchronization) among the
processors.

32
Message Passing

Data transfer synchronization
Synchronization the act of bringing one or more
processes to known points in their execution
Distributed memory memory split up into
segments, each may be accessed by only one
process.

33
Message Passing
May I send?
yes
Send data
34
MPI STANDARD

Standard by consensus, designed in an open forum
Introduced by the MPI FORUM in May 1994, updated
in June 1995.
MPI-2 (1998) produces extensions to the MPI
standard

35
Why use MPI ?

Standardization
Portability
Performance
Richness
Designed to enable libraries

36
Writing an MPI Program

If there is a serial version , make sure it is
debugged
If not, try to write a serial version first
When debugging in parallel , start with a few
nodes first.

37
Format of MPI routines
38
Six useful MPI functions
39
Communication routines
40
End MPI part of program
41

program hello
include mpif.h status(MPI_STATUS_SIZE)
character12 message call MPI_INIT(ierror) call
MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call
MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag
100 if(rank .eq. 0) then message 'Hello,
world' do i1, size-1 call
MPI_SEND(message, 12, MPI_CHARACTER , i,
tag,MPI_COMM_WORLD,ierror)
enddo
else
call MPI_RECV(message, 12, MPI_CHARACTER,
0,tag,MPI_COMM_WORLD, status, ierror)
endif
print, 'node', rank, '', message
call MPI_FINALIZE(ierror)
end

42
int main( int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() return
0
43
MPI Messages

DATA data to be sent
ENVELOPE information to route the data.

44
Description of MPI_Send (MPI_Recv)
45
Description of MPI_Send (MPI_Recv)
46
Some useful remarks

Source MPI_ANY_SOURCE means that any source is
acceptable
Tags specified by sender and receiver must match,
or MPI_ANY_TAG any tag is acceptable
Communicator must be the same for send/receive.
Usually MPI_COMM_WORLD

47
Broadcast

Send data on one node to all other nodes in
communicator.
MPI_Bcast(buffer, count, datatype,root,comm,ierr)

48
Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
49
Performance evaluation

Fortran
Real8 t1
T1 MPI_Wtime() ! Returns elapsed time
C
double t1
t1 MPI_Wtime ()

50
MPI References

The MPI Standard
www-unix.mcs.anl.gov/mpi/index.html
Parallel Programming with MPI,Peter S.
Pacheco,Morgan Kaufmann,1997
Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
The MIT Press,1999.

51
Getting started

Security
Logging in
Shell environment
Transferring file

52
System access-security

Secure access
X-tunelling (for graphics
Can use ssh X for tunnelling

53
Login Environment

Paths and environment variables have been setup
(change things with care)
TCSH is the default (can transfer to bash if you
like)
User modifiable environment variables are in
.cshrc in home directory
Home directory is in /u/courseXX

54
Compilers

Options are gcc, gcc4, suncc for C
g , sunCC for C
G77(no F90) , gfortran,sunf90 for
Fortran77/Fortran90

55
Compilation with MPI

Most MPI implementation support C,C,Fortran77
and Fortran90 bindings.
Scripts for compilation of type mpif77,mpif90,
mpicc etc.
You can specify generic compiler options

56
Flags for compilation

sunf90 fast -xO5 -xarchamd64a myprog.f o myprog
Gcc O3 marchopteron myprog.c o myprog

57
5) Queuing system Sun Grid Engine
58
Sun Grid Engine

Open-source batch queuing system similar to PBS
or LSF
Automatically runs jobs on less loaded nodes
Queue jobs for later execution to avoid
overloading of system

59
SGE properties

Can schedule serial or MPI jobs
- serial jobs run in individual host queues
- parallel jobs must include a parallel
environment request

60
Working with SGE jobs

There are command for querying or modifying the
status of a job running or queued by SGE
- qsub submit a job
- qstat - query the status of a job
- qdel - deleting a job from SGE

61
Submitting a serial job

Create a submit script (basic.sh)
!/bin/sh
scalar example
Echo This code is running on hostname date
end of script

62
Submitting a serial job

The job is submitted to SGE using the qsub
command
qsub basic.sh

63
2 ways of submitting

With arguments
qsub o outputfile j y cwd basic.sh
In submit script

64
Monitoring a job - QSTAT

To list the status and node properties
Qstat

65
Monitoring a job - qstat

qstat output important fields
Job identifier
Job status
- qw- queued and waiting
- t job transferring and about to start
- r job running on listed hosts
- d job has been marked for deletion

66
Deleting a job - QDEL

Single job qdel 151
List of jobs
qdel 151 152 153
All jobs under user
qdel u artemis

67
Output produced by jobs

By default , we get 2 files
ltscriptgt.o.ltjobidgt std output
ltscriptgt.e.ltjobidgt error messages
For parallel jobs, also
ltscriptgt.po.ltjobidgt list of processors the
job ran on

68
Debugging job failures
69
Script for submitting parallel jobs

Mpisub gets as input number of processors and
executable
Ex mpisub 8 ltmyappgt

70
Parallel MPI jobs and SGE

SGE uses the concept of a parallel environment
(PE)
Several PEs can coexist on the machine
Each host has an associated queue and resource
list (time,memory)
A PE is a list of hosts along with a set number
of job slots

71
Queues definition

System job execution policy
Resource allocation
Resource limits
Accounting

72
Two ways to run a batch job
(1) Parameters in command line
(2) Parameters in script file
73
QSUB options
74
Parix run limits
(1) NQS queues on parix
(2) Interactive Maximum cputime 15 minutes
75
Output of command qstat a
76
Exercise 1 login and submit a job

Write a Comment

User Comments (0)