Title: Point Power CDROM .
1???? ?????? ??????
2????? ??? ??' 3
- ???? ????? ?? ???? ?' ?- 27/12/2001
3???????? ???
- ?????? 1-10 ??????? ????? ?? ?????? ???? ??????
???? ???????. - ?? ?????? ?? ???? ?????? ?????? Point Power ????
?????? ?? ???? ?????? ?? CDROM ????.
4??????
- ????? ?????? ?????? ?? ???? ?'.
- ??????? ??????? ?????? ???.
5????? ??????
- Todays topics
- Shared Memory
- Cilk, OpenMP
- MPI Derived Data Types
- How to Build a Beowulf
6Shared Memory
- Goto PDF presentation
- Chapter 8 from Wilkinson Allans book.
- Programming with Shared Memory
7Summary
- Process creation
- The thread concept
- Pthread routines
- How data can be created as shared
- Condition Variables
- Dependency analysis Bernsteins conditions
8Cilk
- http//supertech.lcs.mit.edu/cilk
9Cilk
- A language for multithreaded parallel programming
based on ANSI C. - Cilk is designed for general-purpose parallel
programming language - Cilk is especially effective for exploiting
dynamic, highly asynchronous parallelism.
10A serial C program to compute the nth Fibonacci
number.
11A parallel Cilk program to compute the nth
Fibonacci number.
12Cilk - continue
- Compiling
- cilk -O2 fib.cilk -o fib
- Executing
- fib --nproc 4 30
13OpenMP
Next 5 slides taken from the SC99 tutorial Given
by Tim Mattson, Intel Corporation and Rudolf
Eigenmann, Purdue University
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19?????? ?????
- High-Performance Computing
- Part III
- Shared Memory Parallel Processors
20Back to MPI
21Collective Communication
Broadcast
22Collective Communication
Reduce
23Collective Communication
Gather
24Collective Communication
Allgather
25Collective Communication
Scatter
26Collective Communication
There are more collective communication commands
27?????? ??????? ?- MPI
- MPI Derived Data Types
- MPI-2 Parallel I/O
28User Defined Types
- ???? ?- types ???????? ????, ???? ?????? ?????
??????? ????? - Compact pack/unpack.
29Predefined Types
MPI_DOUBLE double MPI_FLOAT float
MPI_INT signed int MPI_LONG signed long
int MPI_LONG_DOUBLE long double
MPI_LONG_LONG_INT signed long long int
MPI_SHORT signed short int MPI_UNSIGNED
unsigned int MPI_UNSIGNED_CHAR unsigned
char MPI_UNSIGNED_LONG unsigned long int
MPI_UNSIGNED_SHORT unsigned short int MPI_BYTE
30Motivation
- What if you want to specify
- non-contiguous data of a single type?
- contiguous data of mixed types?
- non-contiguous data of mixed types?
Derived datatypes save memory, are faster, more
portable, and elegant.
313 Steps
- Construct the new datatype using appropriate MPI
routinesMPI_Type_contiguous, MPI_Type_vector,
MPI_Type_struct, MPI_Type_indexed,
MPI_Type_hvector, MPI_Type_hindexed - Commit the new datatypeMPI_Type_commitÂ
- Use the new datatype in sends/receives, etc.Use
32includeltmpi.hgt void main(int argc, char
argv) int rank MPI_status status
struct int x int y int z point
MPI_Datatype ptype MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
MPI_Type_contiguous(3,MPI_INT,ptype)
MPI_Type_commit(ptype) if(rank3)
point.x15 point.y23 point.z6
MPI_Send(point,1,ptype,1,52,MPI_COMM_WORLD)
else if(rank1) MPI_Recv(point,1,ptype,3,5
2,MPI_COMM_WORLD,status) printf("Pd received
coords are (d,d,d) \n",rank,point.x,point.y,po
int.z) MPI_Finalize()
33User Defined Types
- MPI_TYPE_STRUCT
- MPI_TYPE_CONTIGUOUS
- MPI_TYPE_VECTOR
- MPI_TYPE_HVECTOR
- MPI_TYPE_INDEXED
- MPI_TYPE_HINDEXED
34MPI_TYPE_STRUCT
is the most general way to construct an MPI
derived type because it allows the length,
location, and type of each component to be
specified independently.
int MPI_Type_struct (int count, int
array_of_blocklengths, MPI_Aint
array_of_displacements, MPI_Datatype
array_of_types, MPI_Datatype newtype)
35Struct Datatype Example
count 2 array_of_blocklengths0
1 array_of_types0 MPI_INT array_of_blocklength
s1 3 array_of_types1 MPI_DOUBLE
36MPI_TYPE_CONTIGUOUS
is the simplest of these, describing a contiguous
sequence of values in memory. For
example, MPI_Type_contiguous(2,MPI_DOUBLE,MPI_2D_
POINT) MPI_Type_contiguous(3,MPI_DOUBLE,MPI_3D_P
OINT)
int MPI_Type_contiguous(int count, MPI_Datatype
oldtype, MPI_Datatype newtype)
37MPI_TYPE_CONTIGUOUS
creates new type indicators MPI_2D_POINT and
MPI_3D_POINT. These type indicators allow you to
treat consecutive pairs of doubles as point
coordinates in a 2-dimensional space and
sequences of three doubles as point coordinates
in a 3-dimensional space.
38MPI_TYPE_VECTOR
describes several such sequences evenly spaced
but not consecutive in memory.
MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR
except that the distance between successive
blocks is specified in bytes rather than elements.
MPI_TYPE_INDEXED describes sequences that may
vary both in length and in spacing.
39MPI_TYPE_VECTOR
int MPI_Type_vector(int count, int blocklength,
int stride, MPI_Datatype oldtype, MPI_Datatype
newtype)
count 2, blocklength 3, stride 5
40????? ??????
includeltmpi.hgt void main(int argc, char
argv) int rank,i,j MPI_status status
double x48 MPI_Datatype coltype
MPI_Init(argc, argv) MPI_Comm_rank(MPI_COMM_W
ORLD,rank) MPI_Type_vector(4,1,8,MPI_DOUBLE,col
type) MPI_Type_commit(coltype)
41if(rank3) for(i0ilt4i)
for(j0jlt8j) xijpow(10.0,i1)j
MPI_Send(x07,1,coltype,1,52,MPI_COMM_WORLD)
else if(rank1) MPI_Recv(x02,1,colty
pe,3,52,MPI_COMM_WORLD,status)
for(i0ilt4i) printf("Pd my
xd21f\n",rank,i,xi2)
MPI_Finalize()
42????
P1 my x0217.000000 P1 my
x12107.000000 P1 my x221007.000000
P1 my x3210007.000000
43(No Transcript)
44Committing a datatype
int MPI_Type_commit (MPI_Datatype datatype)
45Obtaining Information About Derived Types
- MPI_TYPE_LB and MPI_TYPE_UB can provide the lower
and upper bounds of the type. - MPI_TYPE_EXTENT can provide the extent of the
type. In most cases, this is the amount of memory
a value of the type will occupy. - MPI_TYPE_SIZE can provide the size of the type in
a message. If the type is scattered in memory,
this may be significantly smaller than the extent
of the type.
46MPI_TYPE_EXTENT
MPI_Type_extent (MPI_Datatype datatype, MPI_Aint
extent)
Correction
Deprecated. Use MPI_Type_get_extent instead!
47Ref Ian Fosters book DBPP
48MPI-2
MPI-2 is a set of extensions to the MPI standard.
It was finalized by the MPI Forum in June, 1997.
49MPI-2
- New Datatype Manipulation Functions
- Info Object
- New Error Handlers
- Establishing/Releasing Communications
- Extended Collective Operations
- Thread Support
- Fault Tolerant
50MPI-2 Parallel I/O
- Motivation
- The ability to parallelize I/O can offer
significant performance improvements. - User-level checkpointing is contained within the
program itself.
51Parallel I/O
- MPI-2 supports both blocking and nonblocking I/O
- MPI-2 supports both collective and non-collective
I/O
52Complementary Filetypes
53Simple File Scatter/Gather - Problem
54MPI-2 Parallel I/O
- ?????? ??????? ????? ??? ????? ?????? ?????
?????? - MPI-2 file structure
- Initializing MPI-2 File I/O
- Defining a View
- Data Access - Reading Data
- Data Access - Writing Data
- Closing MPI-2 file I/O
55How to Build a Beowulf
56What is a Beowulf?
- A new strategy in High-Performance Computing
(HPC) that exploits mass-market technology to
overcome the oppressive costs in time and money
of supercomputing.
57What is a Beowulf?
- A Collection of personal computers
interconnected by widely available networking
technology running one of several open-source
Unix-like operating systems.
58- COTS Commodity-off-the-shelf components
- Interconnection networks LAN/SAN
Price/Performance
59How to Run Application Faster
- There are 3 ways to improve performance
- 1. Work Harder
- 2. Work Smarter
- 3. Get Help
- Computer Analogy
- 1. Use faster hardware e.g. reduce the time per
instruction (clock cycle). - 2. Optimized algorithms and techniques
- 3. Multiple computers to solve problem That is,
increase no. of instructions executed per clock
cycle.
60Motivation for using Clusters
- The communications bandwidth between workstations
is increasing as new networking technologies and
protocols are implemented in LANs and WANs. - Workstation clusters are easier to integrate into
existing networks than special parallel computers.
61Beowulf-class SystemsA New Paradigm for the
Business of Computing
- Brings high end computing to broad ranged
problems - new markets
- Order of magnitude Price-Performance advantage
- Commodity enabled
- no long development lead times
- Low vulnerability to vendor-specific decisions
- companies are ephemeral Beowulfs are forever
- Rapid response technology tracking
- Just-in-place user-driven configuration
- requirement responsive
- Industry-wide, non-proprietary software
environment
62Beowulf Project - A Brief History
- Started in late 1993
- NASA Goddard Space Flight Center
- NASA JPL, Caltech, academic and industrial
collaborators - Sponsored by NASA HPCC Program
- Applications single user science station
- data intensive
- low cost
- General focus
- single user (dedicated) science and engineering
applications - system scalability
- Ethernet drivers for Linux
63Beowulf System at JPL (Hyglac)
- 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128
Mbyte memory, Fast Ethernet card. - Connected using 100Base-T network,
through a 16-way crossbar switch.
- Theoretical peak performance 3.2 GFlop/s.
- Achieved sustained performance 1.26 GFlop/s.
64Cluster Computing - Research Projects (partial
list)
- Beowulf (CalTech and NASA) - USA
- Condor - Wisconsin State University, USA
- HPVM -(High Performance Virtual Machine),UIUCnow
UCSB,US - MOSIX - Hebrew University of Jerusalem, Israel
- MPI (MPI Forum, MPICH is one of the popular
implementations) - NOW (Network of Workstations) - Berkeley, USA
- NIMROD - Monash University, Australia
- NetSolve - University of Tennessee, USA
- PBS (Portable Batch System) - NASA Ames and LLNL,
USA - PVM - Oak Ridge National Lab./UTK/Emory, USA
65Motivation for using Clusters
- Surveys show utilisation of CPU cycles of desktop
workstations is typically lt10. - Performance of workstations and PCs is rapidly
improving - As performance grows, percent utilisation will
decrease even further! - Organisations are reluctant to buy large
supercomputers, due to the large expense and
short useful life span.
66Motivation for using Clusters
- The development tools for workstations are more
mature than the contrasting proprietary solutions
for parallel computers - mainly due to the
non-standard nature of many parallel systems. - Workstation clusters are a cheap and readily
available alternative to specialised High
Performance Computing (HPC) platforms. - Use of clusters of workstations as a distributed
compute resource is very cost effective -
incremental growth of system!!!
67Original Food Chain Picture
681984 Computer Food Chain
Mainframe
PC
Workstation
Mini Computer
Vector Supercomputer
691994 Computer Food Chain
(hitting wall soon)
Mini Computer
PC
Workstation
Mainframe
(future is bleak)
Vector Supercomputer
MPP
70Computer Food Chain (Now and Future)
71Parallel Computing
Cluster Computing
MetaComputing
Tightly Coupled
Vector
Pile of PCs
NOW/COW
WS Farms/cycle harvesting
Beowulf
NT-PC Cluster
DASHMEM-NUMA
72PC Clusters small, medium, large
73(No Transcript)
74Computing Elements
75Networking
- Topology
- Hardware
- Cost
- Performance
76Cluster Building Blocks
77Channel Bonding
78Myrinet
Myrinet 2000 switch
Myrinet 2000 NIC
79Example 320-host Clos topology of 16-port
switches
64 hosts
64 hosts
64 hosts
64 hosts
64 hosts
(From Myricom)
80Myrinet
- Full-duplex 22 Gigabit/second data rate links,
switch ports, and interface ports. - Flow control, error control, and "heartbeat"
continuity monitoring on every link. - Low-latency, cut-through, crossbar switches, with
monitoring for high-availability applications. - Switch networks that can scale to tens of
thousands of hosts, and that can also provide
alternative communication paths between hosts. - Host interfaces that execute a control program to
interact directly with host processes ("OS
bypass") for low-latency communication, and
directly with the network to send, receive, and
buffer packets.
81Myrinet
- Sustained one-way data rate for large messages
1.92mbps - Latency for short messages 9msec.
82Gigabit Ethernet
Cajun 550
Cajun M770
Cajun P882
Switches by 3COM and Avaya
83(No Transcript)
84Network Topology
85Network Topology
86Network Topology
87Topology of the Velocity Cluster at CTC
88Software all this list for free!
- Compilers FORTRAN, C/C
- Java JDK from Sun, IBM and others
- Scripting Perl, Python, awk
- Editors vi, (x)emacs, kedit, gedit
- Scientific writing LaTex, Ghostview
- Plotting gnuplot
- Image processing xview,
- and much more!!!
89????? ???? ??????
- 32 ?????? top of the line
- ??? ?????? ?????
90Hardware
Dual P4 2HGz
91??? ?? ???? ????
- ???? ??????-4 ????? ?? 2GB ?????? ???? RDRAM
3,000 - 1GB memory/CPU
- ????? ????? 0 (Linux)
92??? ?? ???? ????
- PCI64B _at_ 133MHz, Myrinet2000 NIC with 2M memory
1,195 - Myrinet-2000 fiber cables, 3m long 110
- 16-port switch with Fiber ports 5,625
93??? ?? ???? ????
- KVM 16port. 1,000
- Avocent (Cybex) using cat5 IP over Ethernet
94??? ?? ???? ????
- ???? 30001648,000
- ????? ???(1,195110)1620,880
- ??? ??????5,625
- KVM 1,000
- ??? ?????500
- ??"? 76,005
95- ??? ????? ??????? ????
- 23264GFLOPS
- 76,000/641,187/GFLOP
- Less than 1.2/MFLOP!!!
96?? ??? ?????
- ????!, ????? ???? (?????), ????? ???? ??????
(??-???). - ??? ???? ?????? ???? ???? ????? (NFS or other
files sharing system) - ????? ???????? (users) ???? ???? NIS.
- ????? ???? ??????? ??? ?????? ???? routing ?????
?????? IP ????? ???????. - ??? Monitoring ?????? bWatch.
97????? ??????
- ????? ???? ?????? ???? ????
- ?? ??? ??????? ???? ?????? ??-??? ?????? ?????
????? ?? ????? ?????? (?????? ?"? ???? ????
Ghost).
98????? ????? XXX(???? MPI)
- Download xxx.tar.gz
- Uncompress gzip d xxx.tar.gz
- Untar tar xvf xxx.tar
- Prepare makefile ./configure
- Make (Makefile)
99????? ?????? ??????
- rlogin must be allowed (xinitd disableno)
- Create .rhosts file
- Parallel administration tools brsh, prsh and
self-made scripts.
100References
- Beowulf http//www.beowulf.org
- Computer Architecture
- http//www.cs.wisc.edu/arch/www/
101????? ???
- ?????? ?????? ?-MPI
- Grid Computing
- ??????? ???????? ?????? ??????
- ?????
?? ?????? ????? ?? ?????????! ?????? ??????? ????
???????!