Using the BYU SP-2 - PowerPoint PPT Presentation

About This Presentation
Title:

Using the BYU SP-2

Description:

Your job is a shell script that you hand to the batch scheduler for execution ... m1015i.1626.0 mendez 8/14 13:07 I 50 long. m1015i.1625.0 cr66 8/14 12:40 I 50 medium ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 18
Provided by: quinno
Category:
Tags: byu | mendez | using

less

Transcript and Presenter's Notes

Title: Using the BYU SP-2


1
Using the BYU SP-2
2
Our System
  • Interactive nodes (2)
  • used for login, compilation testing
  • marylou10.et.byu.edu
  • I/O and scheduling nodes (7)
  • used for the batch scheduling system and
    the parallel file system
  • Compute nodes (26)
  • 22 4 processor
  • 4 16 processor

3
Compilers
  • xlc C
  • xlC C
  • xlf Fortran
  • Parallel Compilers
  • mpcc
  • mpCC
  • mpxlf
  • Optimization
  • -O5 -qarchpwr3 -qtunepwr3 -qhot
  • Libraries
  • -lblas, -lfftw, -llapack, -lessl

4
Other Stuff
  • Documentation
  • http//www-1.ibm.com/servers/eserver/pseries/libra
    ry/sp_books/
  • http//marylou.byu.edu
  • Launching parallel jobs
  • done through the batch scheduler
  • Your job is a shell script that you hand to the
    batch scheduler for execution
  • Can look at xloadl for help creating script

5
Batch job scheduler
  • Batch Schedulers
  • PBS (Portable Batch System) open source
  • LoadLeveler - descendent of Condor
  • The process
  • user submits jobs to queue
  • machines register with scheduler offering to run
    jobs of certain class
  • scheduler allocates jobs to machines and tracks
    them
  • once started, jobs are scheduled by kernel

6
Scheduling parallel jobs
  • jobs can ask for
  • number of nodes (1 CPU)
  • number of tasks per node (multiple CPUs)
  • non shared nodes (multiple CPUs)
  • mixing jobs can be bad
  • two intense I/O processes on a 2 CPU node can
    ruin performance for both
  • same for two RAM intensive processes

7
Scheduling parallel jobs (2)
  • All allocated nodes and processors and resources
    are allocated for the duration of the entire job
  • No dynamic adjustments, except by creating jobs
    with multiple steps
  • each step can have different requirements
  • each step can express dependency on other steps

8
Scheduling parallel jobs (3)
  • Management must
  • allow some jobs to use the entire machine
  • allow short jobs to get started quickly they
    should not have to wait weeks in the queue
  • Some very long jobs may be needed, but are to be
    avoided

9
Backfill scheduling
Job C
10 nodes system
Job D
Job B
Job A
time
B
A
C
D
10
Backfill scheduling
  • Requires real time limit to be set
  • More accurate (shorter) estimate gives more
    chance to be running earlier
  • Short jobs can move through system quicker
  • Uses system better by avoiding waste of cycles
    during wait

11
Using LoadLeveler
  • Graphical user interface xloadl
  • Make shell script with LoadLeveler keywords as
    shell comments

_at_output thing.log _at_error thing.err
_at_class short _at_queue
_at_executable thingx _at_node 6,10
_at_tasks_per_node 4 _at_requirements
(Adapterhps_us)
12
Sample LoadLeveler Script
!/bin/ksh _at_ job_type parallel _at_ input
/dev/null _at_ output (Executable).(Cluster).(
Process).out _at_ error (Executable).(Cluster).
(Process).err _at_ initialdir
/gstudent/student_rt_y/directory _at_ notify_user
student_rt_y_at_byu.edu _at_ class short _at_
notification complete _at_ checkpoint no _at_
restart no _at_ requirements (Arch
"power3") _at_ blocking unlimited _at_
total_tasks 4 _at_ network.MPI
switch,shared,US _at_ queue ./your_exe_and_any_arg
s
13
Sample serial job
!/bin/ksh _at_ job_type serial _at_ input
/dev/null _at_ output (Executable).(Cluster).(
Process).out _at_ error (Executable).(Cluster).
(Process).err _at_ initialdir
/gstudent/student_rt_y _at_ notify_user
student_rt_y_at_byu.edu _at_ class medium _at_
notification complete _at_ checkpoint no _at_
restart no _at_ queue paupnew Hlav3ashort.paup
14
LoadLeveler commands
  • llq shows all jobs
  • can also use showq
  • llq -s JobID show why not running
  • llclass shows classes
  • llstatus shows machines
  • llcancel JobID cancel job
  • llhold JobID put job in hold state

15
Sample llq output
bash-2.05a llq Id Owner
Submitted ST PRI Class Running On
------------------------ ---------- -----------
-- --- ------------ ----------- m1015i.1127.0
mdt36 8/7 1241 R 50 long
m1009i m1015i.1128.0 mdt36
8/7 1241 R 50 long m1019i
m1015i.1497.0 jl447 8/12 1625
R 50 long m1012i m1015i.1544.0
to5 8/13 0844 R 50 long
m1045i m1015i.1545.0 to5
8/13 0844 R 50 long m1045i
m1015i.1602.0 taskman 8/14
0813 R 50 short m1017i
m1015i.1598.0 taskman 8/14 0813
R 50 short m1014i m1015i.1601.0
taskman 8/14 0813 R 50 short
m1017i m1015i.1599.0 taskman
8/14 0813 R 50 short m1014i
m1015i.1600.0 taskman 8/14 0813
R 50 short m1011i m1015i.1626.0
mendez 8/14 1307 I 50 long
m1015i.1625.0 cr66
8/14 1240 I 50 medium
m1015i.1513.0 jl447 8/13 0708
I 50 long m1015i.1572.0
dvd 8/13 1045 I 50 medium
m1015i.1576.0 dvd
8/13 1122 I 50 medium
m1015i.1577.0 dvd 8/13 1125
I 50 medium m1015i.1566.0
mdt36 8/13 0851 I 50 long
m1015i.1564.0 mdt36
8/13 0850 I 50 long
m1015i.1612.0 taskman 8/14
0827 I 50 short
m1015i.1624.0 taskman 8/14 0857
I 50 short m1015i.1623.0
taskman 8/14 0857 I 50 short
58 job step(s) in queue, 23 waiting,
0 pending, 35 running, 0 held, 0 preempted
16
Sample showq output
bash-2.05a showq ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC
REMAINING STARTTIME
m1015i.1581.0 taskman Running 1
183900 Wed Aug 14 080624 m1015i.1582.0
taskman Running 1 183900 Wed Aug 14
080624 m1015i.1580.0 taskman Running
1 183900 Wed Aug 14 080624
m1015i.1615.0 taskman Running 1
213342 Wed Aug 14 110106 m1015i.1613.0
taskman Running 1 234305 Wed Aug 14
131029 m1015i.1575.0 dvd Running
4 2151038 Wed Aug 14 043802
m1015i.1127.0 mdt36 Running 8
2231421 Wed Aug 7 124145
m1015i.1567.0 jar65 Running 4
9040744 Tue Aug 13 173508
m1015i.1569.0 jar65 Running 4
9082816 Tue Aug 13 215540
m1015i.1547.0 to5 Running 8
9211149 Wed Aug 14 103913
m1015i.1546.0 to5 Running 8
9211149 Wed Aug 14 103913 35 Active
Jobs 150 of 184 Processors Active (81.52)
26 of 34 Nodes Active
(76.47) IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME m1015i.1513.0 jl447
Idle 2 5000000 Tue Aug 13
070809 m1015i.1572.0 dvd Idle
8 3000000 Tue Aug 13 104518 23
Idle Jobs NON-QUEUED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME Total Jobs 58 Active
Jobs 35 Idle Jobs 23 Non-Queued Jobs 0
17
LoadLeveler environment
  • Normally same as your login environment
  • Limits are set, use llclass -l to see values
  • ulimit -S -a
  • ulimit -H -a
  • Big heap requirements
  • -bmaxdata0x80000000 up to 2 GB data (heap)
  • -q64 -bmaxdata0x. Up to 8 EB
Write a Comment
User Comments (0)
About PowerShow.com