Title: EECE 573: Parallel Programming Using Clusters
1EECE 573Parallel Programming Using Clusters
- Prof. Sunggu Lee
- EE Dept., POSTECH
2Course Introduction
- Objective to learn about PC clusters and
parallel programming using PC clusters - Grading 30 projects, 30 midterm, 30 final,
10 class participation - Home Page www.postech.ac.kr/class/ee770n
- Contact slee_at_postech.ac.kr,Office 2-415 Phone
279-2236 - Ref. Some slides based on/extracted from
www.csse.monash.edu.au/rajkumar/cluster/index.htm
l
3Course Syllabus I
- Introduction (12 weeks)
- Parallel Programming Models and Paradigms
- Parallel Programming Languages and Environments
- Basic Parallel Programming (45 weeks)
- Distributed Memory Parallel Programming
- Using UNIX Utilities
- MPI Programming
- PVM Programming
- Shared Memory Parallel Programming
- Active Objects
- Tuple Space Programming
- Debugging Parallelized Code
4Course Syllabus II
- Parallel Programming Applications (45 weeks)
- Parallel Simulation
- Hardware System Simulation
- Parallel Genetic Algorithms
- Advanced Topics (23 weeks)
- High Availability (HA) Techniques
- Load Balancing
- Using Middleware
- Grid Computing
- Review
5Ch. 1 Parallel Programming Models and Paradigms
- Parallel Computing Architectures
- Vector Parallel Computer
- Cray, IBM S390, NEC SX-6, etc.
- Traditional Parallel Architectures
- SMP (Symmetric Multiprocessor)
- Compaq 4-way server, etc.
- CC-NUMA (Cache Coherent Non-Uniform Memory
Access) - MPP (Massively Parallel Computer)
- Cray T3E, Intel Paragon, etc.
- Cluster
- Grid
6Scalable Parallel Computer Architectures I Buyya
1999
- MPP
- A large parallel processing system with a
shared-nothing architecture - Consist of several hundred nodes with a
high-speed interconnection network/switch - Each node consists of a main memory one or more
processors - Runs a separate copy of the OS
- SMP
- 2-64 processors today
- Shared-everything architecture
- All processors share all the global resources
available - Single copy of the OS runs on these systems
7Scalable Parallel Computer Architectures II
Buyya 1999
- CC-NUMA
- a scalable multiprocessor system having a
cache-coherent nonuniform memory access
architecture - every processor has a global view of all of the
memory - Distributed systems
- considered conventional networks of independent
computers - have multiple system images as each node runs its
own OS - the individual machines could be combinations of
MPPs, SMPs, clusters, individual computers - Cluster
- a collection of workstations of PCs that are
interconnected by a high-speed network - works as an integrated collection of resources
- has a single system image (SSI) spanning all its
nodes
8Key Characteristics of ScalableParallel
Computers Buyya 1999
9Cluster Computer and itsArchitecture Buyya 1999
- A cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnec-ted stand-alone computers
cooperatively workingtogether as a single,
integrated computing resource - One node within a cluster
- a single or multiprocessor system with memory,
I/O facilities, OS - Generally 2 or more computers (nodes) connected
together in a single cabinet, or physically
separated connected via a LAN - appears as a single system to users and
applications - provides a cost-effective way to gain features
and benefits
10Cluster ComputerArchitecture Buyya 1999
11Prominent Components of Cluster Computers (I)
Buyya 1999
- Computing Node
- PC
- Workstation
- SMP (Symmetric Multiprocessor)
- MPP or Cluster ? Grid
12Prominent Components of Cluster Computers (II)
Buyya 1999
- State-of-the-art Operating Systems
- Linux (Beowulf)
- Microsoft NT (Illinois HPVM)
- SUN Solaris (Berkeley NOW)
- IBM AIX (IBM SP2)
- HP UX (Illinois - PANDA)
- Mach (Microkernel based OS) (CMU)
- Cluster Operating Systems (Solaris MC, SCO
Unixware, MOSIX (academic project) - OS gluing layers (Berkeley Glunix)
13Prominent Components of Cluster Computers (III)
Buyya 1999
- High Performance Networks/Switches
- Ethernet (10Mbps),
- Fast Ethernet (100Mbps),
- Gigabit Ethernet (1Gbps)
- 10G Ethernet (10Gbps)
- SCI (Dolphin - MPI- 12micro-sec latency)
- ATM (mostly used in WANs)
- Myrinet (1.2Gbps), Myrinet2000 (2.0Gbps)
- Digital (DEC, now Compaq) Memory Channel
- FDDI
14Parallel Programming Environ-ments and Tools
Buyya 1999
- - Threads (PCs, SMPs, NOW..)
- POSIX Threads
- Java Threads
- MPI (Message Passing Interface an IEEE standard)
- Linux, NT, on many Supercomputers
- PVM (Parallel Virtual Machine)
- Software DSM (Distributed Shared Memory)
- can use sockets, shmem, etc. UNIX utilities
- Compilers
- C/C/Java
- Parallel programming with C (MIT Press book)
- RAD (rapid application development tools)
- GUI based tools for PP modeling
- Debuggers
- Performance Analysis Tools
- Visualization Tools
15Code Parallelization
- Strategies for Developing Parallel Code
- Automatic Parallelization ? difficult
- Depends on use of parallelizing compiler
- Parallel Libraries
- Most commonly used method
- Can use normal Fortran, C, or C
- Use Dedicated Parallel Programming Language
- CC, Concurrent C, etc.
16Code Granularity
- Refer to Table 1.1 (p. 11)
- Fine grain parallelism
- One instance of a loop or instruction block
- Threads (lightweight processes with shared
memory) - Medium grain parallelism
- A function within a program
- Thread or process
- Large grain parallelism
- Heavyweight process
- A single complete separately executable program
17Parallel Programming Paradigms (Methods) I
- Task-Farming (or Master/Slave)
- Master process decomposes problem into small
tasks (jobs, threads, processes) - Master distributes tasks to slave processes (like
workers on a farm) - Slave processes concurrently execute tasks
- Master collects results from slaves
- Repeat above procedure if necessary
- Refer to Fig. 1.4 (p. 20)
18Parallel Programming Paradigms (Methods) II
- Data Parallelism
- Large data space is partitioned
- Each partitioned data space is assigned to a
separate process - Each process executes on its assigned data
partition - Inter-process communication occurs in a mostly
lock-step adjacent-node manner - If necessary, all completed results can be
collected into the final data set - Refer to Fig. 1.5 (p. 21)
19Parallel Programming Paradigms (Methods) III
- Pipelining
- Based on a functional decomposition of the
problem to be solved - Each process executes one of the functions
- Each process corresponds to one stage of a
pipeline used to solve the problem - All stages of pipeline (processes) can be
executed in parallel - Each stage of pipeline should require
approximately equal amount of work - Refer to Fig. 1.6 (p. 22)
20Parallel Programming Paradigms (Methods) IV
- Divide and Conquer (Fig. 1.7, p. 22)
- Main problem is divided into gt 2 subproblems
- Each subproblem is solved by a separate process
(in parallel with others) - Subproblem can be a smaller instance of main
problem ? recursive method - Can sometimes be solved using the task-farming
(master/slave) method
21Parallel Programming Paradigms (Methods) V
- Speculative Parallelism
- System attempts lookahead execution with
several possible execution paths - If execution path is incorrect, its result is
thrown away - Separate processes execute candidate solution
paths in parallel (concurrently) - Useful in simulation or calculation problems
- Hybrid methods also possible
22Programming Skeletons or Templates
- Skeleton or Template
- A piece of generic code which can be used for
many different problems - Just change or add a few small sections
- Like a form which can be filled in
- Base code is typically incomplete
- Requires parameters to be specified
- Aids in reusability and portability