Title: MultiC and HPF DATA PARALLEL LANGUAGES
1MultiC and HPFDATA PARALLEL LANGUAGES
2The MultiC Language
- References for MultiC
- The multiC Programming Language, Preliminary
Documentation, WaveTracer, PUB-00001-001-00.80,
Jan. 1991. - The multiC Programming Language, User
Documentation, WaveTracer, PUB-00001-001-1.02,
June 1992. - Note This presentation is based on the 1991
manual, unless otherwise noted. (e.g., manuals
refers to both versions.) - MultiC is the language used the WaveTracer and
the Zephyr SIMD computers. - The Zephyr is a second generation WaveTracer, but
was never commercially available. - We were given 10 Zephyrs and several other
incomplete Zephyrs to use for spare part - A MultiC was designed for their third
generation computer, but neither were released. - Both MultiC and a parallel language designed for
the MasPar are fairly similar to an earlier
parallel language called C. - C was designed by Guy Steele for the Connection
Machine. - All are data parallel and extensions of the C
language - An assembler was also written for the WaveTracer
(and probably the Zephyr). - It was intended for use only by company
technicians.
3- Information about assembler were released to
WaveTracer customers on a need to know basis. - No manual was distributed but some details were
recorded in a short report. - Professor Potter was given some details needed to
put the ASC language on the WaveTracer - MultiC is an extension to ANSI C, as documented
by the following book - The C Programming Language, Second Edition, 1988,
Kernighan Richie. - The WaveTracer computer is called a Data
Transport Computer (DTC) in manual - a large amount of data can be moved in parallel
using interprocessor communications. - Primary expected uses for WaveTracer were
scientific modeling and scientific computation - acoustic waves
- heat flow
- fluid flow
- medical imaging
- molecular modeling
- neural networks
- The 3-D applications are supported by a 3D mesh
on the WaveTracer - Done by sampling a finite set of points (nodes)
in space.
4WaveTracer Architecture Background
- Architecture for Zephyr is fairly similar
- Exceptions will be mentioned whenever known
- Each board has 4096 bit-serial processors, which
can be connected in any of the following ways - 16x16x16 cube in 3D space
- 64x64 square in 2D space
- 4096 array in 1D space
- The 3D architecture is native on the WT and the
other networks are supported in hardware using
primarily the 3D hardware - The Zephyr probably has a 2D network and only
simulates the more expensive 3D network using
system software. - WaveTracer was available in 1, 2, or 4 boards,
arranged as follows - 2 boards were arranged as a 16x32x16 cube
- one cube stacked on the top of another cube
- 8192 processors overall
5WaveTracer Architecture (Cont)
- Four boards are arranged as a 32x32x16 cube
- 16,384 processors
- Arranged as two columns of stacked cubes
- Computer supports automatic creation of virtual
processors and network connections to connect
these virtual processors. - If each processor supports k nodes, this slows
down execution speed by a factor of k - Each processor performs each operation k times.
- Limited by the amount of memory required for each
virtual node - In practice, slowdown is usually less than k
- The set of virtual processors supported by a
physical processor is called its territory.
6Specifiers for MultiC Variables
- Any datatype in C except pointers can be declared
to be a parallel variable using the declaration
multi - This replicates the data object for each
processor to produce a 1,2, or 3 dimensional data
object - In a parallel execution, all multi objects must
have the same dimension. - The multi declaration follows the same format as
ANSC C, e.g - multi int imag, buffer
- The uni declaration is used to declare a scalar
variable - Is the default and need not be shown.
- The following are equivalent
- uni int ptr
- int ptr
- Bit Length Variables
- can be of type uni or multi
- Allows user to save memory
- All operations can be performed on these
bit-length values - Example A 2 color image can be declared by
- multi unsigned int image 1
- and an 8 color image by
- multi unsigned int picture3
7Some Control Flow Commands
- For uni type data structures, control flow in
MultiC is identical to that in ANSI C. - The parallel IF-ELSE Statement
- As in ASC, both the IF and ELSE portions of the
code is executed. - As with ASC, the IF is a mask-setting operation
rather than a branching command - FORMAT Same as for C
- WARNING In contrast to ASC, both sets of
statements are executed. - Even if no responders are active in one part, the
sequential commands in that part are executed. - Example count count 1
- The parallel WHILE statement
- The format used is
- while(condition)
- The repetition continues as long as condition
is satisfied by one or more responders. - Only those responders (i.e., ones who satisfies
condition preceding to this pass through the
body of while) are active during the execution
of the body of the while.
8Other Commands
- Jump Statements
- goto, return, continue, break
- These commands are in conflict with structured
programming and should be used with restraint. - Parallel Reduction Operators
- Accumulative Product
- / Reciprocal Accumulative Product
- Accumulative Sum
- - Negate then Accumulative Sum
- Accumulative bitwise AND
- Accumulative bitwise OR
- gt? Accumulative Maximum
- lt? Accumulative Minimum
- Each of the above reduction operations return a
uni value and provide a powerful arithmetic
operation. - Each accumulative operation would otherwise
require one or more ANSI C loop constructs. - Example If A is a multi data type
- largest_value gt? A
- smallest_value lt? A
9 - Data Replication
- Example
- multi int A 0
- -
- -
- -
- A 2
- First statement stores 0 in every A field
(compile time) - Last statement stores 2 in A field of every
active PE. - Interprocessor Communications
- Operators have the form
- dx dy dzm
- This operator can shift the components of the
multi variable m of all active processors along
one or more coordinate dimensions. - Example A -1 2 1B
- Causes each active processor to move the data in
its B field to the A field of the processor at
the following location - one unit in the negative X direction
- two units in the positive Y direction
- one unit in the positive Z direction
- Coordinate Axes
10- Conventions
- If value of dz operator is not specified, it is
assumed to be 0 - If the values of dy and dz operators are not
specified, both are assumed to be 0 - Example x yV is the same as x y 0V
- Inactive processor actions
- Does not send its data to another processor
- Participates in moving the data from other
processors along. - Transmission of data occurs in lock step (SIMD
fashion) without congestion or buffering. - Coordinate Functions
- Used to return a coordinate for each active
virtual processor. - Format multi_x(), multi_y(), and multi_z()
- Example
- If(multi_x() 0 multi_y 2 multi_z
1) - u A
- Note that all processors except the one at
(0,2,1) are inactive with the body of the IF. - The accumulated sum of the active components of
the multivariable A is just the value of the
component of A at processor (0,2,1) - Effect of this example is to store the value in A
at (0,2,1) in the uni variable u.
11- If the second command in the example is changed
to - A u
- the effect is to store the contents of the uni
variable u - into multi variable A at location (0,2,1).
- (see manual pg 11-13,14 for more details)
- Arrays
- Multi-pointers are not supported.
- Can not have a parallel variable containing a
pointer in each component of the array. - uni pointers to multi-variables are allowed.
- Array Examples
- int array_1 10
- int array_2 55
- multi int array_3 5
- array_1 is a 1 dimensional standard C array
- array_2 is a 2 dimensional standard C array
- array_3 is a 1-dimensional array of multi
variables - MULTI_PERFORM Command
- Command gives the size of each dimension of all
multi-values prior to calling for a parallel
execution. - Format
12- multi_perform is normally called within the main
program. - Usually calls a subroutine that includes all of
the - parallel work
- parallel I/O
- The main program usually includes
- Opening and closing of files
- Some of the scalar I/O
- define and include statements
- When multi_perform is called, it initializes any
extern and static multi objects - In the previous example, multi_perform calls
func. After func returns, the multi space created
for it becomes undefined. - The perror function is extended to print error
messages corresponding to errno numbers resulting
from the execution of MultiC extensions. - Has the following format
- if(multi_perform(func,x,y,z)) perror(argv0)
- See usage in the examples in Appendix A
- More information on page 11-2 of manual
- Examples in Manual
- Many examples in the manual
- 17 in appendices alone
- Also stored under exname.mc in the MultiC package
13The AnyResponder Function
- Code Segment for Tallying Responders
- unsigned int short, tall
- multi float height
- load_height / assigns value in inches to
height / - if(height gt 70)
- tall (multi int)1
- else
- short (multi int)1
- printf(There are d tall people \n, tall)
- Comments on Code Segment
- Note that the construct
- (multi int)1
- counts the active PE (i.e., responders).
- This technique avoids setting up a bit field to
use to tally active PEs. - Instead sets up a temporary multi variable.
- Can be used to see there is at least one
responder at any given time. - Check to see if resulting sum is positive
- Provides technique to define the AnyResponder
function needed for associative programming
14Accessing Components from Multi Variables
- Code from page 11-13 or 11-14 of MultiC manuals
- include ltmulti.hgt / includes multi library
/ - include ltstdlib.hgt
- include ltstdio.hgt
- void work (void)
- uni int a, b, c, u
- multi int n
- / Code goes here to assign values to n /
- / Code goes here to assign values to a, b, c
/ - if (mult_x() a multi_y() b
- multi_z() c)
- u n / Assigns value of n at
PE(a,b,c) / -
- int main (int argc, char, argv )
- if (multi_perform(work, 7 , 7, 7))
- perror (argv0)
- exit (EXIT_SUCCESS)
-
15The oneof and next Functions
- Function oneof provides a way of selecting one
out of several active processors - Defined in Multi Struct program (A.15) in manual
- Procedure is essential for associative
programming. - Code for oneof
- multi unsigned oneof(void)1
- / Store the coordinate values in multi
variables x and y / - multi unsigned x multi_x(),
- y multi_y(),
- uno1 0
- / Next select processor with highest
coordinate value / - if( x gt? x)
- if( y gt? y)
- uno 1
- return uno
- Note that multi variable uno stores a 1 for
exactly one processor and all the other
coordinates of uno stores a 0. - The function oneof can be used by another
procedure which is called by multi_perform. - An example of oneof being called by another
procedure is given on pages A46-50 of the
manuals. - Should be useable in the form
- if(oneof()) / Check to see if an active
responder exists /
16- Preceding procedure assumed a 2D configuration
of processors with z1. - If configuration is 3D, the process of selecting
the coordinates can be continued by also
selecting the highest z-coordinate. - Stepping through the active PEs (i.e., next)
- Provides the MultiC equivalent of the ASC next
command - An additional one-bit multi integer variable
called bi (for busy-idle) is needed. - First set bi to zero
- Activate the PEs you wish to step through.
- Next, have the active PEs to write a 1 into bi.
- Use
- if(oneof())
- to restrict the mask to one active PEs.
- Perform all desired operations with this PE.
- Have active PE set its bi value to 0 and then
exit the preceding if statement. - Use the (accumulative sum) operator to see
if any PEs remain to be processed. - If so, return to step above calling oneof
- This return can be implemented using a while loop.
17Sequential Printing of Multi Variable Values
- Example Print a block of the image 2D bit array.
- A function select_int is used which will return
the value of image at the specified (x,y,z)
coordinate. - The printing occurs in two loops which
- increments the value of x from 0 to some
specified constant. - increments the value of y from 0 to some
specified constant. - This example is from page 8-1 of the manuals and
is used in an example on pgs A16-18 of 1991
manual and pgs A12-14 of 1992 manual. - The select_int function
- select_int (multi mptr, int x, int y, int z)
- / Here, mptr is a uni pointer to type multi /
- int r
- if( multi_x x
- multi_y y
- multi_z z)
- / Restricts scope to the one PE at (x,y,z) /
- r mptr
- / OR reduction operator transfers binary value
of multi variable at (x,y,z) to the uni variable
/ return r
18- The two loops to print a block of values of the
image multi variable. - for( y 0 y lt ysize y)
- for (x 0 x lt xsize x)
- printf( d, select_int (image,x,y,z)
- printf( \n)
-
- Above technique can be adapted to print or read
multi variables or part of multi variables. - Efficient as long as the number of locations
accessed is small. - If I/O operations involve a large amounts of
data, the more efficient data transfer functions
described in manuals (Chapter 8 and Section 11.2
and 11.13) should be used. - The functions multi_fread and multi_fwrite are
analogous to fwrite and fread in C. Information
about them is given on pages 11-1 to 11-4 of the
manuals.
19Moving Data between Uni Arrays and Multi Variables
- The following functions allow the user to move
data between uni arrays and multi variables - multi_from_uni type
- multi_to_uni type
- The above type may be replaced with a data type
such as - char
- short
- int
- long
- float
- double
- cfloat
- cdouble
- These functions are illustrated in several of the
examples.
20Compiling and Executing Programs on the Zephyr
- A 4k Zephyr machine is available for use in the
Parallel and Associative Computing Lab. - It is presently connected to a Windows 2003
Server which supports remote desktop for
interactive use. However, you may use the
computer directly at the console while the lab is
open - Visual Studio 2002 has been installed on the
server. The MultiC language uses a compiler
wrapper to translate MultiC code into Visual C
code. - Programming the Zephyr on a Windows 2003 system
is similar to that using command line programming
tools in UNIX. - You can edit your program using Edit or
Notepad - You can compile and create an executable using
nmake - You can execute your program using the Visual
Studio Command Shell - This is a special DOS shell that has extra path,
include, and library environment variables used
by the compiler and linker.
21Compiling and Executing Programs on the Zephyr
- Login or use Remote Desktop Connection to
zserver.cs.kent.edu - From Windows XP choose Start Programs
Accessories Communications Remote Desktop
Connection - Enter your login name and password and click on
OK - Open an command window and run the DTC Monitor
program - Type dtcmonitor at the command prompt.
- This is a daemon program that serializes and
controls executables using the Zephyr. - When this 100 complete, you can then execute
programs on the Zephyr. - You can minimize this command shell.
- Important When you are finished enter CTRL-C to
end the dtcmonitor. - Create a folder on your desktop for programs.
You can copy the example Zephyr MultiC program
from D\Common\zephyrtest to your local folder
and rename it for your programming assignment.
22Compiling and Executing Programs on the Zephyr
- Create or edit your MultiC program using DOS edit
or Windows Notepad. - From the Visual Studio Command Shell type
- edit anyprog.mc
- notepad anyproc.mc
- Make sure that the file extension is .mc
- Save your work before compiling
- Modify the makefile template and change the names
of the MultiC file and object file to those used
in your programming assignment. - Compile and link your program by typing
- nmake /f anyprog.mak
- nmake (for the default Makefile)
- Execute your program by typing the name of your
executable at the command prompt. - When you are finished enter CTRL-C to end the
dtcmonitor.
23Fortran 90 and HPF (High Performance Fortran)
- A de facto standard for scientific and
engineering computations
24Fortran 90 AND HPF
- References
- Primary Ian Foster, Designing and Building
Parallel Programs, (online copy), chapter 7. - Jordan and Alaghband, Fundamentals of Parallel
Processing, Prentice Hall, 2003, Sec 3.6 - Recall data parallelism refers to the concurrency
that occurs when all the same operation is
executed on some or all elements in a data set. - A data parallel program is a sequence of such
operations. - Fortran 90 (or F90)is a data-parallel programming
language. - Extensive augmentation of Fortran 77.
- Some job control algorithms can not be expressed
in a data parallel language. - F90s array assignment statement and array
functions can be used to specify certain types of
data parallel computation. - F90 forms the basis of HPF (High Performance
Fortran) which augments F90 with a small set of
extensions. - In F90 and HPF, the (parallel) data structure
operated on are restricted to arrays. - E.g., data types such as trees, sets, etc. are
not supported. - All array elements must have the same type.
- Fortran arrays can have up to 7 dimensions.
25- Parallelism in F90 can be expressed explicitly,
as in the array assignment statement - Integer A(10,10), B(10,10), C(10,10)
- - - - -
- A BC ! A,B,C are arrays
- Compilers may be able to detect implicit
parallelism, as in the following example - do I 1,m
- do j 1,n
- A(i,j) B(i.,j) C(i,j)
- enddo
- enddo
- Parallel execution of above code depends on the
fact that the various do-loops are independent - i.e., one loop does not write/read a variable
that another loop writes/reads. - Compilation can also introduce communications
operations when the computation mapped to one PE
requires data mapped to another PE. - Communication operations in F90 (and HPF) are
inferred by the compiler and do not need to be
specified by the programmer. - These are derived by the compiler from the data
decomposition specified by the programmer. - F90 allows a variety of scalar operations (i.e.,
defined on a single value) to be applied to an
entire array.
26- All F90s unary and binary operations can be
applied to arrays as well, as illustrated in
below examples - real A(10,20), B(10,20), c
- logical L(10,20)
- - - - -
- A B c
- A A 1.0
- A sqrt(A)
- L A .EQ. B
- The function of the mask is handled in F90 by the
where statement, which has two forms. - The first form uses the where to restrict array
elements on which an assignment is performed - For example, the following replaces each nonzero
entry of array with its reciprocal - where(x / 0) x 1.0/X
- The second form of the where is block structured
and has the form - where (mask-expression)
- array_assignment
- elsewhere
- array_assignment
- end where
27Some F90 Array Intrinsic Functions
- Array intrinsic functions below assume a vector
version of an array is formed using column
major ordering - Some F90 array intrinsic functions
- RESHAPE(A,...) converts array A into a new array
with specified shape using fill if needed - PACK(A, MASK, FILL) forms a vector from masked
elements of A, using fill as needed. - UNPACK(A,MASK, FILL) replaces masked elements
with elements from FILL vector - MERGE(A, B, MASK) returns array of elements from
A where mask is true, else from B. - SPREAD(A, DIM, N) replicate array A using N
copies to form a new array of one larger
dimension - CSHIFT(A, SHIFT, DIM) Column major shift of all
vectors of A along dimension DIM by SHIFT - EOSHIFT(A,...) elements of A are shifted off the
end along specified dimension, with end values
with fill from either a specified scalar or array
of dimension 1 less than A - TRANSPOSE(A) returns transpose of array A.
- Some array intrinsic functions that perform
computation - MAXVAL(A) returns the maximum value of A
- MINVAL(A) returns the minimum value of A
- SUM(A) returns the sum of the element of A
- PRODUCT(A) product of elements of A
- MAXLOC(ARRAY) indices of max value in A
- MINLOC(ARRAY) indices of min value in A
- MATMUL(A,B) matrix multiplication AB
- DOT_PRODUCTS(A,B) vector dot product
28The HPF Data Distribution Extension
- Reference Ian Foster, Designing and Building
Parallel Programs, (online copy), chapter 7. - F90 array expressions specify opportunities for
parallel execution but no control over how to
perform these so that communication is minimized. - HPF handling of data distribution involves three
directives - The PROCESSOR directive specifies the shape and
size of the array of abstract processors. - The ALIGN directive is used to align elements of
different arrays with each other, indicating that
they should be distributed in the same manner. - The DISTRIBUTE directive is used to distribute an
object (and all objects aligned with it) onto an
abstract processor array. - The data distribution directives can have a major
impact on a programs performance (but not on the
results computed), affecting - Partitioning of data to processors
- Agglomeration Considering value of combining
tasks to produce fewer larger tasks. - Communications required to coordinate task
execution. - Mapping of tasks to processors
29HPF Data Distribution (Cont.)
- Data distribution directives are recommendations
to a HPF compiler, not instructions. - Compiler can ignore them if it determines that
this action will improve performance. - PROCESSOR directive
- Creates an arrangement for abstract processors
and gives this arrangement a name. - Example !HPF PROCESSORS P(4,8)
- Normally one abstract processor is created for
each physical processor. - There could be more abstract processors than
physical ones. - However, HPF does not specify a way of mapping
abstract to physical processors. - ALIGN Directive
- Specifies array elements that should, if
possible, be mapped to the same processor. - Operations involving data objects that are
aligned are likely to be more efficient due to
reduced communication costs if on same PE. - EXAMPLE
- real B(50), C(50)
- !HPF ALIGN C() WITH B()
30HPF Data Distribution (Cont.)
- ALIGN Directive (cont.)
- A can be used to collapse dimensions (i.e.,
to match one element with many elements - Considerably flexibility is allowed in specifying
which array elements are to be aligned. - Dummy variables can be used for dimensions
- Integer formulas to specify offsets.
- An align statement can be used to specify that
elements of an array should be replicated over
certain processors. - Costly if replicated arrays are updated often.
- Increases communication or redundant computation.
- DISTRIBUTE Directive
- Indicates how data are to be distributed among
processor memories. - Specifies for each dimension of an array one of
three ways that the N array elements in an array
dim. will be distributed among the P processors - No distribution
- BLOCK(n) Block distribution
- (default n N/P)
- CYCLIC(n) Cyclic distribution
- (default n 1)
31HPF Data Distribution (Cont.)
- DISTRIBUTE Directive (cont.)
- Block distribution divides the N items/indices in
that dimension into equal-sized blocks of size
N/P, where P is the number of abstract PEs. - Cyclic distribution maps every Pth index to the
same processor. - Applies not only to the named array but also to
any array that is aligned to it. - The following DISTRIBUTE directives specifies a
mapping for all three arrays. - !HPF PROCESSORS P(20)
- real A(100,100), B(100,100), C(100,100)
- !HPF ALIGN B(,) with A(,)
- !HPF DISTRIBUTE A(BLOCK,) ONTO P
32HPF Concurrency
- The F90 array assignment statements provide a
convenient way of specifying data parallel
operations. - However, this does not apply to all data parallel
operations, as the array on the right hand must
have the same shape as the one on the left hand
side. - HPF provides two other constructs to exploit data
parallelism, namely the FORALL and the
INDEPENDENT directives. - The FORALL Statement
- Allows a more general assignments to sections of
an array. - General form is
- FORALL (triplet, ... , triplet, mask) assignment
- Examples
- FORALL (i1m, j1,n) X(i,j) ij
- FORALL (i1n, j1,n, iltj) Y(i,j) 0.0
- The INDEPENDENT Directive and Do-Loops
- The INDEPENDENT directive can be used to assert
that the iterations of a do-loop can be performed
independently, that is - They can be performed in any order
- They can be performed concurrently
- The INDEPENDENT directive must immediately
precede the do-loop that it applies to. - Examples of independent and non-independent
do-loops are given in 19, Foster, pg 258-9.
33Additional HPF Comments
- A HPF program typically consists of a sequence of
calls to subroutines and functions. - The data distribution that is best for a
subroutine may be different than the data
distribution used in the calling program. - Two possible strategies for handling this
situation are - Specify a local distribution using DISTRIBUTE and
ALIGN, even if this requires expensive data
movement on entering - Cost normally occurs on return as well.
- Use whatever data distribution is used in the
calling program, even if not optimal. This
requires use of INHERIT directive. - Both F90 and HPF intrinsic functions (e.g., SUM,
MAXVAL) combine data from entire arrays and
involve considerable communication. - Some other F90/HPF intrinsic functions such as
DOT_PRODUCT involve communciation cost only if
their arguments are not aligned. - Array operations involving the FORALL statement
can result in communication if the computation of
a value for an element A(i) require data values
that are not on the same processor (e.g., B(j)).