Title: DATA PARALLEL LANGUAGES Chapter 4b
 1DATA PARALLEL LANGUAGES(Chapter 4b)
- multiC, 
- Fortran 90, and HPF 
2The MultiC Language
- References 
- The multiC Programming Language, Preliminary 
 Documentation, WaveTracer, PUB-00001-001-00.80,
 Jan. 1991.
- The multiC Programming Language, User 
 Documentation, WaveTracer, PUB-00001-001-1.02,
 June 1992.
- Note This presentation is based on the 1991 
 manual, unless otherwise noted. (e.g., manuals
 refers to both versions.)
- MultiC is the language used the WaveTracer and 
 the Zephyr SIMD computers.
- The Zephyr is a second generation WaveTracer, but 
 was never commercially available.
- We were given 10 Zephyrs and several other 
 incomplete Zephyrs to use for spare part
- A MultiC was designed for their third 
 generation computer, but neither were released.
- Both MultiC and a parallel language designed for 
 the MasPar are fairly similar to an earlier
 parallel language called C.
- C was designed by Guy Steele for the Connection 
 Machine.
-  All are data parallel and extensions of the C 
 language
- An assembler was also written for the WaveTracer 
 (and probably the Zephyr).
-  It was intended for use only by company 
 technicians.
3- Information about assembler were released to 
 WaveTracer customers on a need to know basis.
- No manual was distributed but some details were 
 recorded in a short report.
- Professor Potter was given some details needed to 
 put the ASC language on the WaveTracer
- MultiC is an extension to ANSI C, as documented 
 by the following book
- The C Programming Language, Second Edition, 1988, 
 Kernighan  Richie.
- The WaveTracer computer is called a Data 
 Transport Computer (DTC) in manual
- a large amount of data can be moved in parallel 
 using interprocessor communications.
- Primary expected uses for WaveTracer were 
 scientific modeling and scientific computation
- acoustic waves 
- heat flow 
- fluid flow 
- medical imaging 
- molecular modeling 
- neural networks 
- The 3-D applications are supported by a 3D mesh 
 on the WaveTracer
- Done by sampling a finite set of points (nodes) 
 in space.
4WaveTracer Architecture Background
- Architecture for Zephyr is fairly similar 
- Exceptions will be mentioned whenever known 
- Each board has 4096 bit-serial processors, which 
 can be connected in any of the following ways
- 16x16x16 cube in 3D space 
- 64x64 square in 2D space 
- 4096 array in 1D space 
- The 3D architecture is native on the WT and the 
 other networks are supported in hardware using
 primarily the 3D hardware
- The Zephyr probably has a 2D network and only 
 simulates the more expensive 3D network using
 system software.
- WaveTracer was available in 1, 2, or 4 boards, 
 arranged as follows
- 2 boards were arranged as a 16x32x16 cube 
- one cube stacked on the top of another cube 
- 8192 processors overall 
5WaveTracer Architecture (Cont)
- Four boards are arranged as a 32x32x16 cube 
- 16,384 processors 
- Arranged as two columns of stacked cubes 
- Computer supports automatic creation of virtual 
 processors and network connections to connect
 these virtual processors.
- If each processor supports k nodes, this slows 
 down execution speed by a factor of k
- Each processor performs each operation k times. 
- Limited by the amount of memory required for each 
 virtual node
- In practice, slowdown is usually less than k 
- The set of virtual processors supported by a 
 physical processor is called its territory.
6Specifiers for MultiC Variables
- Any datatype in C except pointers can be declared 
 to be a parallel variable using the declaration
 multi
- This replicates the data object for each 
 processor to produce a 1,2, or 3 dimensional data
 object
- In a parallel execution, all multi objects must 
 have the same dimension.
- The multi declaration follows the same format as 
 ANSC C, e.g
-  multi int imag, buffer 
- The uni declaration is used to declare a scalar 
 variable
- Is the default and need not be shown. 
- The following are equivalent 
- uni int ptr 
- int ptr 
- Bit Length Variables 
- can be of type uni or multi 
- Allows user to save memory 
- All operations can be performed on these 
 bit-length values
- Example A 2 color image can be declared by 
- multi unsigned int image 1 
- and an 8 color image by 
- multi unsigned int picture3
7Some Control Flow Commands
- For uni type data structures, control flow in 
 MultiC is identical to that in ANSI C.
- The parallel IF-ELSE Statement 
- As in ASC, both the IF and ELSE portions of the 
 code is executed.
- As with ASC, the IF is a mask-setting operation 
 rather than a branching command
- FORMAT Same as for C 
- WARNING In contrast to ASC, both sets of 
 statements are executed.
- Even if no responders are active in one part, the 
 sequential commands in that part are executed.
- Example count  count  1 
- The parallel WHILE statement 
- The format used is 
- while(condition) 
- The repetition continues as long as condition 
 is satisfied by one or more responders.
- Only those responders (i.e., ones who satisfies 
 condition preceding to this pass through the
 body of while) are active during the execution
 of the body of the while.
8Other Commands
- Jump Statements 
- goto, return, continue, break 
- These commands are in conflict with structured 
 programming and should be used with restraint.
- Parallel Reduction Operators 
-  Accumulative Product 
- / Reciprocal Accumulative Product 
-  Accumulative Sum 
- - Negate  then Accumulative Sum 
-  Accumulative bitwise AND 
-  Accumulative bitwise OR 
- gt? Accumulative Maximum 
- lt? Accumulative Minimum 
- Each of the above reduction operations return a 
 uni value and provide a powerful arithmetic
 operation.
- Each accumulative operation would otherwise 
 require one or more ANSI C loop constructs.
- Example If A is a multi data type 
- largest_value  gt? A 
- smallest_value  lt? A
9- Data Replication 
- Example 
-  multi int A  0 
-  - 
-  - 
-  - 
-  A  2 
- First statement stores 0 in every A field 
 (compile time)
- Last statement stores 2 in A field of every 
 active PE.
- Interprocessor Communications 
- Operators have the form 
- dx dy dzm 
- This operator can shift the components of the 
 multi variable m of all active processors along
 one or more coordinate dimensions.
- Example A  -1 2 1B 
- Causes each active processor to move the data in 
 its B field to the A field of the processor at
 the following location
- one unit in the negative X direction 
- two units in the positive Y direction 
- one unit in the positive Z direction 
- Coordinate Axes 
10- Conventions 
- If value of dz operator is not specified, it is 
 assumed to be 0
- If the values of dy and dz operators are not 
 specified, both are assumed to be 0
- Example x yV is the same as x y 0V 
- Inactive processor actions 
- Does not send its data to another processor 
- Participates in moving the data from other 
 processors along.
- Transmission of data occurs in lock step (SIMD 
 fashion) without congestion or buffering.
- Coordinate Functions 
- Used to return a coordinate for each active 
 virtual processor.
- Format multi_x(), multi_y(), and multi_z() 
- Example 
- If(multi_x()   0  multi_y   2  multi_z  
 1)
- u   A 
- Note that all processors except the one at 
 (0,2,1) are inactive with the body of the IF.
- The accumulated sum of the active components of 
 the multivariable A is just the value of the
 component of A at processor (0,2,1)
- Effect of this example is to store the value in A 
 at (0,2,1) in the uni variable u.
11- If the second command in the example is changed 
 to
- A  u 
- the effect is to store the contents of the uni 
 variable u
- into multi variable A at location (0,2,1). 
- (see manual pg 11-13,14 for more details) 
-  Arrays 
- Multi-pointers are not supported. 
- Can not have a parallel variable containing a 
 pointer in each component of the array.
- uni pointers to multi-variables are allowed. 
- Array Examples 
- int array_1 10 
- int array_2 55 
- multi int array_3 5 
- array_1 is a 1 dimensional standard C array 
- array_2 is a 2 dimensional standard C array 
- array_3 is a 1-dimensional array of multi 
 variables
- MULTI_PERFORM Command 
- Command gives the size of each dimension of all 
 multi-values prior to calling for a parallel
 execution.
- Format 
12- multi_perform is normally called within the main 
 program.
- Usually calls a subroutine that includes all of 
 the
- parallel work 
- parallel I/O 
- The main program usually includes 
- Opening and closing of files 
- Some of the scalar I/O 
- define and include statements 
- When multi_perform is called, it initializes any 
 extern and static multi objects
- In the previous example, multi_perform calls 
 func. After func returns, the multi space created
 for it becomes undefined.
- The perror function is extended to print error 
 messages corresponding to errno numbers resulting
 from the execution of MultiC extensions.
- Has the following format 
- if(multi_perform(func,x,y,z)) perror(argv0) 
- See usage in the examples in Appendix A 
- More information on page 11-2 of manual 
- Examples in Manual 
- Many examples in the manual 
- 17 in appendices alone 
- Also stored under exname.mc in the MultiC package 
13The AnyResponder Function
- Code Segment for Tallying Responders 
- unsigned int short, tall 
- multi float height 
- load_height / assigns value in inches to 
 height /
- if(height gt 70) 
-  tall   (multi int)1 
- else 
-  short   (multi int)1 
- printf(There are d tall people \n, tall) 
- Comments on Code Segment 
- Note that the construct 
-  (multi int)1 
- counts the active PE (i.e., responders). 
- This technique avoids setting up a bit field to 
 use to tally active PEs.
- Instead sets up a temporary multi variable. 
- Can be used to see there is at least one 
 responder at any given time.
-  Check to see if resulting sum is positive 
- Provides technique to define the AnyResponder 
 function needed for associative programming
14Accessing Components from Multi Variables
- Code from page 11-13 or 11-14 of MultiC manuals 
- include ltmulti.hgt / includes multi library 
 /
- include ltstdlib.hgt 
- include ltstdio.hgt 
- void work (void) 
-  uni int a, b, c, u 
-  multi int n 
-  / Code goes here to assign values to n / 
-  / Code goes here to assign values to a, b, c 
 /
-  if (mult_x()  a  multi_y()  b 
-   multi_z()  c) 
-  u   n / Assigns value of n at 
 PE(a,b,c) /
-  
-  int main (int argc, char, argv ) 
-  if (multi_perform(work, 7 , 7, 7)) 
-  perror (argv0) 
-  exit (EXIT_SUCCESS) 
-  
15The oneof and next Functions
- Function oneof provides a way of selecting one 
 out of several active processors
- Defined in Multi Struct program (A.15) in manual 
- Procedure is essential for associative 
 programming.
- Code for oneof 
- multi unsigned oneof(void)1 
-  / Store the coordinate values in multi 
 variables x and y /
-  multi unsigned x  multi_x(), 
-  y  multi_y(), 
-  uno1  0 
-  / Next select processor with highest 
 coordinate value /
-  if( x  gt? x) 
-  if( y  gt? y) 
-  uno  1 
-  return uno 
- Note that multi variable uno stores a 1 for 
 exactly one processor and all the other
 coordinates of uno stores a 0.
- The function oneof can be used by another 
 procedure which is called by multi_perform.
- An example of oneof being called by another 
 procedure is given on pages A46-50 of the
 manuals.
- Should be useable in the form 
- if(oneof()) / Check to see if an active 
 responder exists /
16- Preceding procedure assumed a 2D configuration 
 of processors with z1.
- If configuration is 3D, the process of selecting 
 the coordinates can be continued by also
 selecting the highest z-coordinate.
- Stepping through the active PEs (i.e., next) 
- Provides the MultiC equivalent of the ASC next 
 command
- An additional one-bit multi integer variable 
 called bi (for busy-idle) is needed.
- First set bi to zero 
- Activate the PEs you wish to step through. 
- Next, have the active PEs to write a 1 into bi. 
- Use 
- if(oneof()) 
- to restrict the mask to one of the active PEs. 
- Perform all desired operations with active PE. 
- Have active PE set its bi value to 0 and then 
 exit the preceding if statement.
- Use the  (accumulative sum) operator to see 
 if any PEs remain to be processed.
- If so, return to step above calling oneof 
- This return can be implemented using a while loop.
17Sequential Printing of Multi Variable Values
- Example Print a block of the image 2D bit array. 
 
- A function select_int is used which will return 
 the value of image at the specified (x,y,z)
 coordinate.
- The printing occurs in two loops which 
-  increments the value of x from 0 to some 
 specified constant.
-  increments the value of y from 0 to some 
 specified constant.
- This example is from page 8-1 of the manuals and 
 is used in an example on pgs A16-18 of 1991
 manual and pgs A12-14 of 1992 manual.
- The select_int function 
- select_int (multi mptr, int x, int y, int z) 
- / Here, mptr is a uni pointer to type multi / 
-  int r 
-  if( multi_x  x  
-  multi_y  y  
-  multi_z  z) 
-  / Restricts scope to the one PE at (x,y,z) / 
-  r   mptr 
-  / OR reduction operator transfers binary value 
 of multi variable at (x,y,z) to the uni variable
 / return r
18- The two loops to print a block of values of the 
 image multi variable.
- for( y  0 y lt ysize y) 
-  for (x 0 x lt xsize x) 
-  printf( d, select_int (image,x,y,z) 
-  printf( \n) 
-  
- Above technique can be adapted to print or read 
 multi variables or part of multi variables.
- Efficient as long as the number of locations 
 accessed is small.
- If I/O operations involve a large amounts of 
 data, the more efficient data transfer functions
 described in manuals (Chapter 8 and Section 11.2
 and 11.13) should be used.
- The functions multi_fread and multi_fwrite are 
 analogous to fwrite and fread in C. Information
 about them is given on pages 11-1 to 11-4 of the
 manuals.
19Moving Data between Uni Arrays and Multi Variables
- The following functions allow the user to move 
 data between uni arrays and multi variables
- multi_from_uni ... 
- multi_to_uni ... 
- The above  may be replaced with a data type 
 such as
- char 
- short 
- int 
- long 
- float 
- double 
- cfloat 
- cdouble 
- These functions are illustrated in several of the 
 examples.
20Compiling and Executing Programs on the Zephyr
- A 4k Zephyr machine is available for use in the 
 Parallel and Associative Computing Lab.
- It is presently connected to a Windows 2003 
 Server which supports remote desktop for
 interactive use. However, you may use the
 computer directly at the console while the lab is
 open
- Visual Studio 2002 has been installed on the 
 server. The MultiC language uses a compiler
 wrapper to translate MultiC code into Visual C
 code.
- Programming the Zephyr on a Windows 2003 system 
 is similar to that using command line programming
 tools in UNIX.
- You can edit your program using Edit or 
 Notepad
- You can compile and create an executable using 
 nmake
- You can execute your program using the Visual 
 Studio Command Shell
- This is a special DOS shell that has extra path, 
 include, and library environment variables used
 by the compiler and linker.
21Compiling and Executing Programs on the Zephyr
- Login or use Remote Desktop Connection to 
 zserver.cs.kent.edu
- From Windows XP choose Start  Programs  
 Accessories  Communications  Remote Desktop
 Connection
- Enter your login name and password and click on 
 OK
- Open an command window and run the DTC Monitor 
 program
- Type dtcmonitor at the command prompt. 
- This is a daemon program that serializes and 
 controls executables using the Zephyr.
- When this 100 complete, you can then execute 
 programs on the Zephyr.
- You can minimize this command shell. 
- Important When you are finished enter CTRL-C to 
 end the dtcmonitor.
- Create a folder on your desktop for programs. 
 You can copy the example Zephyr MultiC program
 from D\Common\zephyrtest to your local folder
 and rename it for your programming assignment.
22Compiling and Executing Programs on the Zephyr
- Create or edit your MultiC program using DOS edit 
 or Windows Notepad.
- From the Visual Studio Command Shell type 
- edit anyprog.mc 
- notepad anyproc.mc 
- Make sure that the file extension is .mc 
- Save your work before compiling 
- Modify the makefile template and change the names 
 of the MultiC file and object file to those used
 in your programming assignment.
- Compile and link your program by typing 
- nmake /f anyprog.mak 
- nmake (for the default Makefile) 
- Execute your program by typing the name of your 
 executable at the command prompt.
- When you are finished enter CTRL-C to end the 
 dtcmonitor.
23OMIT FOR PRESENT(Multi-C Recursion)
- It is possible to write recursive multi 
 functions in multiC, but you have to test if
 there are active PEs still working.
- Consider the following multiC function 
-  multi int factorial( multi int n ) 
-   
-  multi int r 
-  if( n ! 1 ) 
-  r  (factorial(n-1)n) 
-  else 
-  r  1 
-  return( r ) 
-   
- What happens?
24OMIT FOR PRESENT (MultiC Recursion Example)
- Recursion 
- multi int factorial( multi int n ) 
-  
-  multi int r 
-  / stop calculating if every component has been 
 computed /
-  if( !  (multi int) 1 ) 
-  return(( multi int ) 0 ) 
-  / otherwise, continue calculating / 
-  if( n gt 1 ) 
-  r  factorial( n-1 )  n 
-  else 
-  r  1 
-  return( r ) 
25Fortran 90 and HPF (High Performance Fortran)
- A de facto standard for scientific and 
 engineering computations
26Fortran 90 AND HPF
- References 
- 19 Ian Foster, Designing and Building Parallel 
 Programs, (online copy), chapter 7.
- 8 Jordan and Alaghband, Fundamentals of 
 Parallel Processing, Section 3.6.
- Recall data parallelism refers to the concurrency 
 that occurs when all the same operation is
 executed on some or all elements in a data set.
- A data parallel program is a sequence of such 
 operations.
- Fortran 90 (or F90)is a data-parallel programming 
 language.
- Some job control algorithms can not be expressed 
 in a data parallel language.
- F90s array assignment statement and array 
 functions can be used to specify certain types of
 data parallel computation.
- F90 forms the basis of HPF (High Performance 
 Fortran) which augments F90 with a small set of
 extensions.
-  In F90 and HPF, the (parallel) data structure 
 operated on are restricted to arrays.
- E.g., data types such as trees, sets, etc. are 
 not supported.
- All array elements must be of the same type. 
- Fortran arrays can have up to 7 dimensions. 
27- Parallelism in F90 can be expressed explicitly, 
 as in the array assignment statement
- A BC ! A,B,C are arrays 
- Compilers may be able to detect implicit 
 parallelism, as in the following example
- do I  1,m 
- do j  1,n 
- A(i,j)  B(i.,j) C(i,j) 
- enddo 
- enddo 
- Parallel execution of above code depends on the 
 fact that the various do-loops are independent
- i.e., one loop does not write/read a variable 
 that another loop writes/reads.
- Compilation can also introduce communications 
 operations when the computation mapped to one PE
 requires data mapped to another PE.
- Communication operations in F90 (and HPF) are 
 inferred by the compiler and do not need to be
 specified by the programmer.
- These are derived by the compiler from the data 
 decomposition specified by the programmer.
- F90 allows a variety of scalar operations (i.e., 
 defined on a single value) to be applied to an
 entire array.
28- All F90s unary and binary operations can be 
 applied to arrays as well, as illustrated in
 below examples
- real A(10,200), B(10,10), c 
- logical L(10,20) 
- A  B  c 
- A  A  1.0 
- A  sqrt(A) 
- L  A .EQ. B 
- The function of the mask is handled in F90 by the 
 where statement, which has two forms.
- The first form uses the where to restrict array 
 elements on which an assignment is performed
- For example, the following replaces each nonzero 
 entry of array with its reciprocal
- where(x / 0) x  1.0/X 
- The second form of the where is block structured 
 and has the form
- where (mask-expression) 
- array_assignment 
- elsewhere 
- array_assignment 
- end where
29Some F90 Array Intrinsic Functions
- Array intrinsic functions below assume a vector 
 version of an array is formed using column
 major ordering
- Some F90 array intrinsic functions 
- RESHAPE(A,...) converts array A into a new array 
 with specified shape and fill
- PACK(A, MASK, FILL) forms a vector from masked 
 elements of A, using fill as needed.
- UNPACK(A,MASK, FILL) replaces masked elements 
 with elements from FILL vector
- MERGE(A, B, MASK) returns array of masked A 
 entries and unmasked entries of B
- SPREAD(A, DIM, N) replicate array A, using N 
 using N copies to form a new array of one larger
-  dimension 
- CSHIFT(A, SHIFT, DIM) column major rotation of 
 elements of A
- EOSHIFT(A,...) elements of A are shifted off the 
 end along specified dimension, with end values
 with fill from either a specified scalar or array
 of dimension 1 less than A
- TRANSPOSE(A) returns transpose of array A. 
- Some array intrinsic functions that perform 
 computation
- MAXVAL(A) returns the maximum value of A 
- MINVAL(A) returns the minimum value of A 
- SUM(A) returns the sum of the element of A 
- PRODUCT(A) product of elements of A 
- MAXLOC(ARRAY) indices of max value in A 
- MINLOC(ARRAY) indices of min value in A 
- MATMUL(A,B) matrix multiplication AB 
30The HPF Data Distribution Extension
- Reference 19 Ian Foster, Designing and 
 Building Parallel Programs, (online copy),
 chapter 7.
- F90 array expressions specify opportunities for 
 parallel execution but no control over how to
 perform these so that communication is minimized.
- HPF handling of data distribution involves three 
 directives
- The PROCESSOR directive specifies the shape and 
 size of the array of abstract processors.
- The ALIGN directive is used to align elements of 
 different arrays with each other, indicating that
 they should be distributed in the same manner.
- The DISTRIBUTE directive is used to distribute an 
 object (and all objects aligned with it) onto an
 abstract processor array.
- The data distribution directives can have a major 
 impact on a programs performance (but not on the
 results computed), affecting
- Partitioning of data to processors 
- Agglomeration  Considering value of combining 
 tasks to produce fewer larger tasks.
- Communications required to coordinate task 
 execution.
- Mapping of tasks to processors
31HPF Data Distribution (Cont.)
- Data distribution directives are recommendations 
 to a HPF compiler, not instructions.
- Compiler can ignore them if it determines that 
 this will improve performance.
- PROCESSOR directive 
- Creates an arrangement for abstract processors 
 and gives this arrangement a name.
- Example !HPF PROCESSORS P(4,8) 
- Normally one abstract processor is created for 
 each physical processor.
- There could be more abstract processors than 
 physical ones.
- However, HPF does not specify a way of mapping 
 abstract to physical processors.
- ALIGN Directive 
- Specifies array elements that should, if 
 possible, be mapped to the same processor.
- Operations involving data objects that are 
 aligned are likely to be more efficient due to
 reduced communication costs if on same PE.
- EXAMPLE 
- real B(50), C(50) 
- !HPF ALIGN C() WITH B() 
32HPF Data Distribution (Cont.)
- ALIGN Directive (cont.) 
- A  can be used to collapse dimensions (i.e., 
 to match one element with many elements
- Considerably flexibility is allowed in specifying 
 which array elements are to be aligned.
- Dummy variables can be used for dimensions 
- Integer formulas to specify offsets. 
-  An align statement can be used to specify that 
 elements of an array should be replicated over
 certain processors.
- Costly if replicated arrays are updated often. 
- Increases communication or redundant computation. 
- DISTRIBUTE Directive 
- Indicates how data are to be distributed among 
 processor memories.
- Specifies for each dimension of an array one of 
 three ways that the array elements will be
 distributed among the processors
-   No distribution 
-  BLOCK(n) Block distribution 
-  (default n  N/P) 
-  CYCLIC(n) Cyclic distribution 
-  (default n  1) 
33HPF Data Distribution (Cont.)
- DISTRIBUTE Directive (cont.) 
- Block distribution divides the items/indices in 
 that dimension into equal-sized blocks of size
 N/P.
- Cyclic distribution maps every Pth index to the 
 same processor.
- Applies not only to the named array but also to 
 any array that is aligned to it.
- The following DISTRIBUTE directives specifies a 
 mapping for all three arrays.
- !HPF PROCESSORS p(20) 
-  real A(100,100), B(100,100), C(100,100) 
- !HPF ALIGN B(,) with A(,) 
- !HPF DISTRIBUTE A(BLOCK,) ONTO p 
34HPF Concurrency 
- The F90 array assignment statements provide a 
 convenient way of specifying data parallel
 operations.
- However, this does not apply to all data parallel 
 operations, as the array on the right hand must
 have the same shape as the one on the left hand
 side.
- HPF provides two other constructs to exploit data 
 parallelism, namely the FORALL and the
 INDEPENDENT directives.
- The FORALL Statement 
- Allows a more general assignments to sections of 
 an array.
- General form is 
- FORALL (triplet, ... , triplet, mask) assignment 
- Examples 
- FORALL (i1m, j1,n) X(i,j)  ij 
- FORALL (i1n, j1,n, iltj) Y(i,j)  0.0 
- The INDEPENDENT Directive and Do-Loops 
- The INDEPENDENT directive can be used to assert 
 that the iterations of a do-loop can be performed
 independently, that is
- They can be performed in any order 
- They can be performed concurrently 
- The INDEPENDENT directive must immediately 
 precede the do-loop that it applies to.
- Examples of independent and non-independent 
 do-loops are given in 19, Foster, pg 258-9.
35Additional HPF Comments
- A HPF program typically consists of a sequence of 
 calls to subroutines and functions.
- The data distribution that is best for a 
 subroutine may be different than the data
 distribution used in the calling program.
- Two possible strategies for handling this 
 situation are
- Specify a local distribution using DISTRIBUTE and 
 ALIGN, even if this requires expensive data
 movement on entering
- Cost normally occurs on return as well. 
- Use whatever data distribution is used in the 
 calling program, even if not optimal. This
 requires use of INHERIT directive.
-  Both F90 and HPF intrinsic functions (e.g., SUM, 
 MAXVAL) combine data from entire arrays and
 involve considerable communication.
- Some other F90/HPF intrinsic functions such as 
 DOT_PRODUCT involve communciation cost only if
 their arguments are not aligned.
- Array operations involving the FORALL statement 
 can result in communication if the computation of
 a value for an element A(i) require data values
 that are not on the same processor (e.g., B(j)).