Distributed - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed

Description:

... Parallel Computing Cluster. Patrick McGuigan. mcguigan_at_cse.uta. ... requires large scale storage (tera to peta bytes), high ... Patrick McGuigan. 2/12 ... – PowerPoint PPT presentation

Number of Views:263
Avg rating:3.0/5.0
Slides: 43
Provided by: P522
Learn more at: http://www-hep.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Distributed


1
Distributed Parallel Computing Cluster
  • Patrick McGuigan
  • mcguigan_at_cse.uta.edu

2
DPCC Background
  • NSF funded Major Research Instrumentation (MRI)
    grant
  • Goals
  • Personnel
  • PI
  • Co-PIs
  • Senior Personnel
  • Systems Administrator

3
DPCC Goals
  • Establish a regional Distributed and Parallel
    Computing Cluster at UTA (DPCC_at_UTA)
  • An inter-departmental and inter-institutional
    facility
  • Facilitate collaborative research that requires
    large scale storage (tera to peta bytes), high
    speed access (gigabit or more) and mega
    processing (100s of processors)

4
DPCC Research Areas
  • Data mining / KDD
  • Association rules, graph-mining, Stream
    processing etc.
  • High Energy Physics
  • Simulation, moving towards a regional Dø Center
  • Dermatology/skin cancer
  • Image database, lesion detection and monitoring
  • Distributed computing
  • Grid computing, PICO
  • Networking
  • Non-intrusive network performance evaluation
  • Software Engineering
  • Formal specification and verification
  • Multimedia
  • Video streaming, scene analysis
  • Facilitate collaborative efforts that need
    high-performance computing

5
DPCC Personnel
  • PI Dr. Chakravarthy
  • CO-PIs
  • Drs. Aslandogan, Das, Holder, Yu
  • Senior Personnel
  • Paul Bergstresser, Kaushik De, Farhad Kamangar,
    David Kung, Mohan Kumar, David Levine, Jung-Hwan
    Oh, Gregely Zaruba
  • Systems Administrator
  • Patrick McGuigan

6
DPCC Components
  • Establish a distributed memory cluster (150
    processors)
  • Establish a Symmetric or shared multiprocessor
    system
  • Establish a large shareable high speed storage
    (100s of Terabytes)

7
DPCC Cluster as of 2/1/2004
  • Located in 101 GACB
  • Inauguration 2/23 as part of E-Week
  • 5 racks of equipment UPS

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
  • Photos (scaled for presentation)

12
DPCC Resources
  • 97 machines
  • 81 worker nodes
  • 2 interactive nodes
  • 10 IDE based RAID servers
  • 4 nodes support Fibre Channel SAN
  • 50 TB storage
  • 4.5 TB in each IDE RAID
  • 5.2 TB in FC SAN

13
DPCC Resources (continued)
  • 1 Gb/s network interconnections
  • core switch
  • satellite switches
  • 1 Gb/s SAN network
  • UPS

14
DPCC Layout
15
DPCC Resource Details
  • Worker nodes
  • Dual Xeon processors
  • 32 machines _at_ 2.4GHz
  • 49 machines _at_ 2.6GHz
  • 2 GB RAM
  • IDE Storage
  • 32 machines _at_ 60 GB
  • 49 machines _at_ 80 GB
  • Redhat 7.3 Linux (2.4.20 kernel)

16
DPCC Resource Details (cont.)
  • Raid Server
  • Dual Xeon processors (2.4 GHz)
  • 2 GB RAM
  • 4 Raid Controllers
  • 2 port controller (qty 1) Mirrored OS disks
  • 8 port controller (qty 3) RAID5 with hot spare
  • 24 250GB disks
  • 2 40GB disk
  • NFS used to support worker nodes

17
DPCC Resource Details (cont.)
  • FC SAN
  • RAID5 Array
  • 42 142GB FC disks
  • FC Switch
  • 3 GFS nodes
  • Dual Xeon (2.4 Ghz)
  • 2 GB RAM
  • Global File System (GFS)
  • Serve to cluster via NFS
  • 1 GFS Lockserver

18
Using DPCC
  • Two nodes available for interactive use
  • master.dpcc.uta.edu
  • grid.dpcc.uta.edu
  • More nodes are likely to support other services
    (Web, DB access)
  • Access through SSH (version 2 client)
  • Freeware Windows clients are available (ssh.com)
  • File transfers through SCP/SFTP

19
Using DPCC (continued)
  • User quotas not implemented on home directory
    yet. Be sensible in your usage.
  • Large data sets will be stored on RAIDs
    (requires coordination with sys admin)
  • All storage visible to all nodes.

20
Getting Accounts
  • Have your supervisor request account
  • Account will be created
  • Bring ID to 101 GACB to receive password
  • Keep password safe
  • Login to any interactive machine
  • master.dpcc.uta.edu
  • grid.dpcc.uta.edu
  • USE yppasswd command to change password
  • If you forget your password
  • See me in my office, I will reset your password
    (with ID)
  • Call or e-mail me, I will reset your password to
    the original password

21
User environment
  • Default shell is bash
  • Change with ypchsh
  • Customize user environment using startup files
  • .bash_profile (login session)
  • .bashrc (non-login)
  • Customize with statements like
  • export ltvariablegtltvaluegt
  • source ltshell filegt
  • Much more information in man page

22
Program development tools
  • GCC 2.96
  • C
  • C
  • Java (gcj)
  • Objective C
  • Chill
  • Java
  • JDK Sun J2SDK 1.4.2

23
Development tools (cont.)
  • Python
  • python version 1.5.2
  • python2 version 2.2.2
  • Perl
  • Version 5.6.1
  • Flex, Bison, gdb
  • If your favorite tool is not available, well
    consider adding it!

24
Batch Queue System
  • OpenPBS
  • Server runs on master
  • pbs_mom runs on worker nodes
  • Scheduler runs on master
  • Jobs can be submitted from any interactive node
  • User commands
  • qsub submit a job for execution
  • qstat determine status of job, queue, server
  • qdel delete a job from the queue
  • qalter modify attributes of a job
  • Single queue (workq)

25
PBS qsub
  • qsub used to submit jobs to PBS
  • A job is represented by a shell script
  • Shell script can alter environment and proceed
    with execution
  • Script may contain embedded PBS directives
  • Script is responsible for starting parallel jobs
    (not PBS)

26
Hello World
  • mcguigan_at_master pbs_examples cat helloworld
  • echo Hello World from HOSTNAME
  • mcguigan_at_master pbs_examples qsub helloworld
  • 15795.master.cluster
  • mcguigan_at_master pbs_examples ls
  • helloworld helloworld.e15795 helloworld.o15795
  • mcguigan_at_master pbs_examples more
    helloworld.o15795
  • Hello World from node1.cluster

27
Hello World (continued)
  • Job ID is returned from qsub
  • Default attributes allow job to run
  • 1 Node
  • 1 CPU
  • 360000 CPU time
  • standard out and standard error streams are
    returned

28
Hello World (continued)
  • Environment of job
  • Defaults to login shell (overide with !) or S
    switch
  • Login environment variable list with PBS
    additions
  • PBS_O_HOST
  • PBS_O_QUEUE
  • PBS_O_WORKDIR
  • PBS_ENVIRONMENT
  • PBS_JOBID
  • PBS_JOBNAME
  • PBS_NODEFILE
  • PBS_QUEUE
  • Additional environment variables may be
    transferred using -v switch

29
PBS Environment Variables
  • PBS_ENVIRONMENT
  • PBS_JOBCOOKIE
  • PBS_JOBID
  • PBS_JOBNAME
  • PBS_MOMPORT
  • PBS_NODENUM
  • PBS_O_HOME
  • PBS_O_HOST
  • PBS_O_LANG
  • PBS_O_LOGNAME
  • PBS_O_MAIL
  • PBS_O_PATH
  • PBS_O_QUEUE
  • PBS_O_SHELL
  • PBS_O_WORKDIR
  • PBS_QUEUEworkq
  • PBS_TASKNUM1

30
qsub options
  • Output streams
  • -e (error output path)
  • -o (standard output path)
  • -j (join error output as either output or
    error)
  • Mail options
  • -m aben when to mail (abort, begin, end, none)
  • -M who to mail
  • Name of job
  • -N (15 printable characters MAX first is
    alphabetical)
  • Which queue to submit job to
  • -q name Unimportant for now
  • Environment variables
  • -v pass specific variables
  • -V pass all environment variables of qsub to job
  • Additional attributes
  • -w specify dependencies

31
Qsub options (continued)
  • -l switch used to specify needed resources
  • Number of nodes
  • nodes x
  • Number of processors
  • ncpus x
  • CPU time
  • cputhhmmss
  • Walltime
  • walltimehhmmss
  • See man page for pbs_resources

32
Hello World
  • qsub l nodes1 l ncpus1 l cput360000 N
    helloworld m a q workq helloworld
  • Options can be included in script
  • PBS -l nodes1
  • PBS -l ncpus1
  • PBS -m a
  • PBS -N helloworld2
  • PBS -l cput360000
  • echo Hello World from HOSTNAME

33
qstat
  • Used to determine status of jobs, queues, server
  • qstat
  • qstat ltjob idgt
  • Switches
  • -u ltusergt list jobs of user
  • -f provides extra output
  • -n provides nodes given to job
  • -q status of the queue
  • -i show idle jobs

34
qdel qalter
  • qdel used to remove a job from a queue
  • qdel ltjob IDgt
  • qalter used to alter attributes of currently
    queued job
  • qalter ltjob idgt attributes (similar to qsub)

35
Processing on a worker node
  • All RAID storage visible to all nodes
  • /dataxy where x is raid ID, y is Volume (1-3)
  • /gfsx where x is gfs volume (1-3)
  • Local storage on each worker node
  • /scratch
  • Data intensive applications should copy input
    data (when possible) to /scratch for manipulation
    and copy results back to raid storage

36
Parallel Processing
  • MPI installed on interactive worker nodes
  • MPICH 1.2.5
  • Path /usr/local/mpich-1.2.5
  • Asking for multiple processors
  • -l nodesx
  • -l ncpus2x

37
Parallel Processing (continued)
  • PBS node file created when job executes
  • Available to job via PBS_NODEFILE
  • Used to start processes on remote nodes
  • mpirun
  • rsh

38
Using node file (example job)
  • !/bin/sh
  • PBS -m n
  • PBS -l nodes3ppn2
  • PBS -l walltime003000
  • PBS -j oe
  • PBS -o helloworld.out
  • PBS -N helloword_mpi
  • NNcat PBS_NODEFILE wc -l
  • echo "Processors received "NN
  • echo "script running on host hostname"
  • cd PBS_O_WORKDIR
  • echo
  • echo "PBS NODE FILE"
  • cat PBS_NODEFILE
  • echo
  • /usr/local/mpich-1.2.5/bin/mpirun -machinefile
    PBS_NODEFILE -np NN /mpi-example/helloworld

39
MPI
  • Shared Memory vs. Message Passing
  • MPI
  • C based library to allow programs to communicate
  • Each cooperating execution is running the same
    program image
  • Different images can do different computations
    based on notion of rank
  • MPI primitives allow for construction of more
    sophisticated synchronization mechanisms
    (barrier, mutex)

40
helloworld.c
  • include ltstdio.hgt
  • include ltunistd.hgt
  • include ltstring.hgt
  • include "mpi.h"
  • int main( argc, argv )
  • int argc
  • char argv
  • int rank, size
  • char host256
  • int val
  • val gethostname(host,255)
  • if ( val ! 0 )
  • strcpy(host,"UNKNOWN")
  • MPI_Init( argc, argv )

41
Using MPI programs
  • Compiling
  • /usr/local/mpich-1.2.5/bin/mpicc helloworld.c
  • Executing
  • /usr/local/mpich-1.2.5/bin/mpirun ltoptionsgt \
    helloworld
  • Common options
  • -np number of processes to create
  • -machinefile list of nodes to run on

42
Resources for MPI
  • http//www-hep.uta.edu/mcguigan/dpcc/mpi
  • MPI documentation
  • http//www-unix.mcs.anl.gov/mpi/indexold.html
  • Links to various tutorials
  • Parallel programming course
Write a Comment
User Comments (0)
About PowerShow.com