Parallel computing on nanco an introductory course - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Parallel computing on nanco an introductory course

Description:

Parallel Power for HPC. A closely coupled, scalable set of ... T1= MPI_Wtime() ! Returns elapsed time. C: double t1 ; t1 =MPI_Wtime (); MPI References ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 77
Provided by: suz86
Category:

less

Transcript and Presenter's Notes

Title: Parallel computing on nanco an introductory course


1
Parallel computing on nanco- an introductory
course
  • Anne Weill Zrahia
  • Technion,Computer Center
  • May 2007

2
(No Transcript)
3
Parallel Power for HPC
  • A closely coupled, scalable set of
    interconnected computer system, sharing common
    hardware and software infrastructure, providing a
    parallel set of resources to applications for
    improved performance, throughput and availability.

4
Resources needed for applications arising from
Nanotechnology
  • Large memory Tbytes
  • High floating point computing speed Tflops
  • High data throughput state of the art

5
Parallel Programming on the Nanco
  • Parallelization Concepts
  • Nanco Computer Design
  • Efficient Scalar Design
  • Parallel Programming -MPI
  • 5) Queuing system - SGE

6
3) Compilers and tools
7
Parallel classification
  • Parallel architectures
  • Shared Memory /
  • Distributed Memory
  • Programming paradigms
  • Data parallel /
  • Message passing

8
Shared Memory
  • Each processor can access any part of the memory
  • Access times are uniform (in principle)
  • Easier to program (no explicit message passing)
  • Bottleneck when several tasks access same
    location

9
SMP architecture
P
P
P
P
Memory
10
Distributed Memory
  • Processor can only access local memory
  • Access times depend on location
  • Processors must communicate via explicit message
    passing

11
Distributed Memory
Processor Memory
Processor Memory
Interconnection network
12
Message Passing Programming
  • Separate program on each processor
  • Local Memory
  • Control over distribution and transfer of data
  • Additional complexity of debugging due to
    communications

13
Why not a cluster
  • Single SMP system easier to purchase/maintain
  • Ease of programming in SMP systems

14
Why a cluster
  • Scalability
  • Total available physical RAM
  • Reduced cost
  • But

15
Performance issues
  • Concurrency ability to perform actions
    simultaneously
  • Scalability performance is not impaired by
    increasing number of processors
  • Locality high ration of local memory
    accesses/remote memory accesses (or low
    communication)

16
SP2 Benchmark
  • Goal Checking performance of real world
    applications on the SP2
  • Execution time (seconds)CPU time for
    applications
  • Speedup
  • Execution time for 1 processor
  • ---------------------------------
    ---
  • Execution time for p processors

17
(No Transcript)
18
2) Nanco design
19
Nanco architecture
20
Configuration
M
M
M
P
P
P
P
P
P
node2
node64
node1
Infiniband Switch
21
Configuration
  • 64 dual-processor, dual core compute nodes, each
    dual-core Opteron Rev. F
  • 8GB RAM memory/node
  • 2 master nodes for H/A , also Opterons
  • Infiniband Interconnect switch HCAs
  • Netapp storage

22
(No Transcript)
23
2) Parallel Programming-MPI
24
AMD Opteron processor
25
Memory bottleneck
26
AMD Hypertransport
27
(No Transcript)
28
How does this reflect on performance?

29
Performance
  • Access to local memory 1hop
  • Access to 2nd processor memory 2hops
  • Prefetch can be useful for predictable patterns
  • Multithreading can be used at node level

30
WHAT is MPI?
  • A message- passing library specification
  • Extended message-passing model
  • Not specific to implementation or computer

31
BASICS of MPI PROGRAMMING
  • MPI is a message-passing library
  • Assumes a distributed memory architecture
  • Includes routines for performing communication
    (exchange of data and synchronization) among the
    processors.

32
Message Passing
  • Data transfer synchronization
  • Synchronization the act of bringing one or more
    processes to known points in their execution
  • Distributed memory memory split up into
    segments, each may be accessed by only one
    process.

33
Message Passing
May I send?
yes
Send data
34
MPI STANDARD
  • Standard by consensus, designed in an open forum
  • Introduced by the MPI FORUM in May 1994, updated
    in June 1995.
  • MPI-2 (1998) produces extensions to the MPI
    standard

35
Why use MPI ?
  • Standardization
  • Portability
  • Performance
  • Richness
  • Designed to enable libraries

36
Writing an MPI Program
  • If there is a serial version , make sure it is
    debugged
  • If not, try to write a serial version first
  • When debugging in parallel , start with a few
    nodes first.

37
Format of MPI routines
38
Six useful MPI functions
39
Communication routines
40
End MPI part of program
41
  • program hello
  • include mpif.h status(MPI_STATUS_SIZE)
    character12 message call MPI_INIT(ierror) call
    MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call
    MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag
    100 if(rank .eq. 0) then message 'Hello,
    world' do i1, size-1 call
    MPI_SEND(message, 12, MPI_CHARACTER , i,
    tag,MPI_COMM_WORLD,ierror)
  • enddo
  • else
  • call MPI_RECV(message, 12, MPI_CHARACTER,
    0,tag,MPI_COMM_WORLD, status, ierror)
  • endif
  • print, 'node', rank, '', message
  • call MPI_FINALIZE(ierror)
  • end

42
int main( int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() return
0
43
MPI Messages
  • DATA data to be sent
  • ENVELOPE information to route the data.

44
Description of MPI_Send (MPI_Recv)
45
Description of MPI_Send (MPI_Recv)
46
Some useful remarks
  • Source MPI_ANY_SOURCE means that any source is
    acceptable
  • Tags specified by sender and receiver must match,
    or MPI_ANY_TAG any tag is acceptable
  • Communicator must be the same for send/receive.
    Usually MPI_COMM_WORLD

47
Broadcast
  • Send data on one node to all other nodes in
    communicator.
  • MPI_Bcast(buffer, count, datatype,root,comm,ierr)

48
Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
49
Performance evaluation
  • Fortran
  • Real8 t1
  • T1 MPI_Wtime() ! Returns elapsed time
  • C
  • double t1
  • t1 MPI_Wtime ()

50
MPI References
  • The MPI Standard
  • www-unix.mcs.anl.gov/mpi/index.html
  • Parallel Programming with MPI,Peter S.
    Pacheco,Morgan Kaufmann,1997
  • Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
    The MIT Press,1999.

51
Getting started
  • Security
  • Logging in
  • Shell environment
  • Transferring file

52
System access-security
  • Secure access
  • X-tunelling (for graphics
  • Can use ssh X for tunnelling

53
Login Environment
  • Paths and environment variables have been setup
    (change things with care)
  • TCSH is the default (can transfer to bash if you
    like)
  • User modifiable environment variables are in
    .cshrc in home directory
  • Home directory is in /u/courseXX

54
Compilers
  • Options are gcc, gcc4, suncc for C
  • g , sunCC for C
  • G77(no F90) , gfortran,sunf90 for
    Fortran77/Fortran90

55
Compilation with MPI
  • Most MPI implementation support C,C,Fortran77
    and Fortran90 bindings.
  • Scripts for compilation of type mpif77,mpif90,
    mpicc etc.
  • You can specify generic compiler options

56
Flags for compilation
  • sunf90 fast -xO5 -xarchamd64a myprog.f o myprog
  • Gcc O3 marchopteron myprog.c o myprog

57
5) Queuing system Sun Grid Engine
58
Sun Grid Engine
  • Open-source batch queuing system similar to PBS
    or LSF
  • Automatically runs jobs on less loaded nodes
  • Queue jobs for later execution to avoid
    overloading of system

59
SGE properties
  • Can schedule serial or MPI jobs
  • - serial jobs run in individual host queues
  • - parallel jobs must include a parallel
    environment request

60
Working with SGE jobs
  • There are command for querying or modifying the
    status of a job running or queued by SGE
  • - qsub submit a job
  • - qstat - query the status of a job
  • - qdel - deleting a job from SGE

61
Submitting a serial job
  • Create a submit script (basic.sh)
  • !/bin/sh
  • scalar example
  • Echo This code is running on hostname date
  • end of script

62
Submitting a serial job
  • The job is submitted to SGE using the qsub
    command
  • qsub basic.sh

63
2 ways of submitting
  • With arguments
  • qsub o outputfile j y cwd basic.sh
  • In submit script

64
Monitoring a job - QSTAT
  • To list the status and node properties
  • Qstat

65
Monitoring a job - qstat
  • qstat output important fields
  • Job identifier
  • Job status
  • - qw- queued and waiting
  • - t job transferring and about to start
  • - r job running on listed hosts
  • - d job has been marked for deletion

66
Deleting a job - QDEL
  • Single job qdel 151
  • List of jobs
  • qdel 151 152 153
  • All jobs under user
  • qdel u artemis

67
Output produced by jobs
  • By default , we get 2 files
  • ltscriptgt.o.ltjobidgt std output
  • ltscriptgt.e.ltjobidgt error messages
  • For parallel jobs, also
  • ltscriptgt.po.ltjobidgt list of processors the
    job ran on

68
Debugging job failures
69
Script for submitting parallel jobs
  • Mpisub gets as input number of processors and
    executable
  • Ex mpisub 8 ltmyappgt

70
Parallel MPI jobs and SGE
  • SGE uses the concept of a parallel environment
    (PE)
  • Several PEs can coexist on the machine
  • Each host has an associated queue and resource
    list (time,memory)
  • A PE is a list of hosts along with a set number
    of job slots

71
Queues definition
  • System job execution policy
  • Resource allocation
  • Resource limits
  • Accounting

72
Two ways to run a batch job
(1) Parameters in command line
(2) Parameters in script file
73
QSUB options
74
Parix run limits
(1) NQS queues on parix
(2) Interactive Maximum cputime 15 minutes
75
Output of command qstat a
76
Exercise 1 login and submit a job
Write a Comment
User Comments (0)
About PowerShow.com