aCe C

Language Reference

 

John E. Dorband

NASA Goddard Space Flight Center

Greenbelt, MD 20771

 

 

Introduction

This document is an introduction to the language aCe C.  aCe stands for architecture-adaptive computing environment.  aCe C is a superset of ANS C.  In this document, the term aCe will be used synonymously for aCe C. This description of aCe C assumes that the reader is knowledgeable of ANS C.  The purpose of aCe is to facilitate the writing of parallel programs.  This is done by allowing the programmer to explicitly describe the parallelism of an algorithm. The concepts behind aCe C have been gleaned from such parallel languages as APL3, DAP Fortran7, Parallel Pascal4, Parallel Forth1, C*5, CM Lisp6, FGPC2, and MPL8.

Typically, a C program has one thread of execution.  This is the path that a computer takes through a program while executing it.  More sophisticated compilers and run time environments may be able to infer from the code which portions of the execution thread may be performed concurrently without conflict.  However, it is very difficult to perform this task automatically.  aCe allows the programmer to explicitly express that which can be performed concurrently, i.e. the parallelism, thus eliminating the need for a compiler to second-guess the intents of the programmer.

Basics

The following is a valid aCe program.  Note that it looks no different from a standard C program: 

         int main ( int argc, char **argv ) {

                printf(ÒHello World\nÓ);

         }

The difference is that C has only one thread of execution.  aCe, however, may have many threads of execution.  Each thread may be referenced by name and index.  The 'Hello World' program's thread is implicitly named 'MAIN'.  This is important when it is necessary to communicate between the 'MAIN' thread and other threads executing concurrently with the 'MAIN'. 

Parallelism in aCe is expressed by first defining a set of concurrently executable threads.  A group of parallel threads can be viewed as a bundle of executing threads, a cluster of processes, or an array of processors.  These three views will be treated synonymously.  In aCe, a bundle of threads is defined with the 'threads' statement.  The following statement only declares the intent of the programmer to use 10 concurrent threads of executions named 'A' at some point in his code:

         threads A[10];

These threads must be assigned storage before they can execute any code.  Each thread will have its own private storage. Variables of a thread can not be accessed directly by any other thread.  In aCe, there is no global storage, only storage local to each thread.   Storage is declared for a thread by a standard C declaration preceded by the thread's name.  All threads of a bundle will be allocated space for any given declaration. The following statement allocates an integer 'aval' for each of the 10 threads previously defined as bundle 'A': 

         A int aval;

Once storage has been assigned to a thread, then code may be written that will be executed by the thread.  The following is a simple piece of code that adds the ten values of val2 to the ten values of val3 and stores the ten results in the ten locations of val1: 

         threads A[10];

         A int val1,val2,val3;

         int main () {

                A.{

                       val1 = val2 + val3 ;

                }

         }

Granted, the values of val2 or val3 were never initialized, but that is a different issue.  The important point here is that execution started with the function 'main' by the default thread 'MAIN', which transferred control to (forks) the 10 threads of 'A' to add the values of val2 to the values val3 before returning control to 'MAIN'.  For parallelism to be useful, the storage of each thread must contain different values.  This can be done by different means: 1) read different values into the storage of each thread, 2) copy a built-in value that is unique to each thread into the storage of the thread, or 3) obtain a unique value from another thread.  In the first, case the function 'fread' may be used to read values into the storage of a thread.

         A.{ fread(&val2,sizeof(val2),1,file); }

The above fread in the context of the threads of A will read 10 values from the input file and put them in the 10 locations of val2 of the 10 threads of A.  The second way of putting a unique value into each of the 10 locations of val2 would be to assign a built in value to val2.  

         A.{  val2 = $$i ;  }

The above statement will assign the value of the built-in value '$$i' to val2.  The value of '$$i' is the index of the thread.  In the case of A, each thread will have a unique index from 0 to 9. 

Previously, it was pointed out that a thread only has direct access to storage local to itself.  A thread, however, can access storage of another thread indirectly through communication operations.  There are two basic communication operations.  Under other parallel programming paradigms they are referred to as 'get' and 'put' operations.  The following is an example of an aCe 'get' operation:

         A.{ int a,b; 

                a = A[($$i+1)%$$N].b ;

         }

In this example, the value of 'b' is fetch from one thread of A to another thread of A. Remember that the value of $$i is the index of the executing thread and that $$N is the number of threads in the bundle A.  Thus the value of (($$i+1))%$$N) is the index of a thread other than the thread performing the 'get' operation.  The communication operation uses this value to determine which thread to fetch the value of 'b' from.  The following statement is an example of an aCe 'put' operation:

         A.{ int a,b;   

                A[($$i+1)%$$N].b = a ;

         }

In this example, the value of 'a' of the current thread is stored into 'b' of a different thread of A.

In summary, aCe allows the programmer to declare a bundle of parallel threads of execution, allocate the storage of each thread, define code to execute on threads concurrently, and move values between threads.  These are the four essential concepts of aCe: execution, storage, code, and communication.

Bundles of threads

In the previous section, the bundle A was declared as a one-dimensional array of threads.  A bundle may actually be declared with any number of dimensions.  The following statement declares the bundle B to have three dimensions of sizes 2, 7, and 20 respectively and contain 280 threads: 

         threads B[2][7][20] ;

One may also declare bundles of bundles of threads.  In the statement below, 'e' is a bundle of bundles of threads.  'e' consists of 100 bundles, each containing 2 bundles, one with 10 threads and the other with 20 threads.  One should view each bundle of 'e' as 1 e-thread, 10 c-threads, and 20 d-threads, where the e-thread is the parent thread of the 10 c-threads and 20 d-threads.  Thus, there are a total 3,100 threads defined by the following statement:

         threads { c[10],  d[20]  } e[100] ;

Because the definition of a bundle is recursive, a bundle can contain any number of sub-bundles.  Note, all bundles are 'descendents' of the bundle 'MAIN'.  The primary bundle of a bundle declaration, such as 'e', is an immediate child bundle of 'MAIN'.  The following example is a more complex definition meant only to demonstrate the recursive nature of a bundle declaration: 

         threads { s[11], { { u[34],  v[3]  } w[102]  } t[7] [7]  } z[100] ;

Execution

All operations that can be performed by the thread, 'MAIN' (i.e., any C code), may be performed by any bundle of threads concurrently.  The only operation supported by ANS C, but not by aCe C is 'goto'.

All code that is not labeled with the name of a bundle, by default, will be executed by the lone thread 'MAIN'.  The program entry point routine 'main' is run by the thread 'MAIN'.  To start code running on a bundle of threads other than 'MAIN', a compound statement must be labeled with the name of that bundle.  A compound statement is code enclosed in braces, {}.  A compound statement can contain any valid aCe C code.  The following code is a compound statement labeled with the bundle name B:

              B.{ a=b; }

This code copies the value at location 'b' to the location 'a', assuming 'a' and 'b' are declared storage locations associated with the bundle B.  However, if a conditional statement is executed from within the following labeled compound statement:

         B.{  if(z) { a=b; } }

Some threads of 'B' will copy b to a and some will not, depending on whether 'z' was true or not, respectively.  During the copy operation all threads for which 'z' is true are said to be 'active', and all threads for which 'z' is false are said to be 'inactive'. Within the context of a segment of code for bundle 'B' all active threads of 'B' will execute the code, while all inactive threads will remain idle.  Initially, all threads of all bundles are active, and each bundle will only execute code explicitly designated for that bundle.  Conditional statements are used to make some threads of a bundle inactive for a portion of code. 

A labeled compound statement creates an execution context in which a bundle of threads may execute code.  These execution contexts may be nested.  In the following example, all threads of B will copy 'b' to 'a', then all threads of A will copy 'c' to 'd', and finally all threads of 'B' will copy 'y' to 'x':

         B.{ a=b;  A.{ d=c; }  x=y; }

This would be equivalent to the following statement:

         B.{ a=b; }  A.{ d=c; }  B.{ x=y; }

Then why would it be necessary for contexts to be nestable?  It is only necessary for contexts to be nestable if the execution of one context can implicitly affect the execution of the other.  This can happen if the contained context is contained within a condition statement of the containing context, as in the following statement:

         B.{  if (a) {  A.{ x=y; }  }  }

In this example, if 'a' is true for any thread of 'B', then 'y' is copied to 'x' for all threads of 'A'. Otherwise, if 'a' is false for all threads of 'B', then 'y' will not be copied to 'x' for any thread of 'A'.  Putting it another way, if any thread of B is active within the context of A, then all threads of A will be active.  The next example is a little more complicated.

         B.{  if (a) {  A.{ x=y; }  } else { C.{ s=t; } }  }

As with the previous example, 'x' is copied to 'y' only if 'a' is true for at least one thread of B.  And 't' is copied to 's' if 'a' is false for at least one thread of B.  An interesting side effect of this statement is that 'x' will be copied to 'y' for all threads of A and 't' will be copied to 's' for all threads of C as long as 'a' is true for some threads and false for others. 

Routines in aCe must be declared as to which bundle may call them.  This is done by preceding the routine's declaration with the name of the bundle. 

                       B int aroutine( int c, int b) { ... code ... }

The preceding routine, aroutine, may called within the context of any code executed by 'B'. 

Communication

A conditional expression can also affect communications.  Only active threads can initiate a get or put operation.  In the following statement, only threads of B for which 'a' is true actually fetch a value from the storage location 'x' of a thread of A: 

         B.{  if (a) {  b=A[idx].x;  }  }

Again, only active threads of 'B' will fetch (or get) values from 'A'.  And in the following statement, only active threads of 'B' will store (or put) values into threads of A.

         B.{  if (a) {  A[idx].y = b;  }  }

Threads of 'A' need not be active to be fetched from or have their storage modified.  The following is an example of this:

         A.{ if (z){  B.{  if (a) {  A[idx].y = b;  }  }   } }

Some threads of 'A' are made inactive; yet, it would execute no differently than the previous example. 

More on Concurrent Conditional Execution

As has already been pointed out, one can view conditional execution as the deactivation of threads during conditional code that does not apply, rather than as the action of skipping the execution of code that does not apply.  Therefore, conditional execution on a bundle of threads deactivates all threads for which the condition is false.  The inactive threads remain inactive until either the conditional structure is exited or the threads are reactivated, as in the case of switch structures or loop-continue combinations.  A conditional statement only applies to the threads that were active when the conditional statement was entered. 

The if statement may deactivate some of the active threads until the corresponding else is reached; then the original threads are restored and those that had been active are deactivated.  A loop structure (for, while, and do-while) deactivates more and more of the active threads as the threads fail the loop condition.  A break statement deactivates all active threads until the corresponding loop or switch structure has been exited.  A continue statement deactivates all active threads only until the current iteration of loop has been completed.  A return statement deactivates all active threads until all threads that entered the routine return or complete the routine. 

A continue and break statement in C applies specifically to the control structure that includes it.  This applies also to aCe C; in addition, however,  the control structure must be executing within the same bundle context as the continue or break statement.  Similarly, return statements must belong to the same bundle context as the calling routine. 

Previously it was pointed out that a condition statement executed in one bundle could implicitly affect the threads of another independent bundle.  This can happen between threads with common ancestry as well.  A parent thread controls its children, and children threads affect their parent.  If a thread becomes inactive, all of its children become inactive.  And if all the children threads of a parent thread become inactive, the parent thread also becomes inactive. 

 

Thread Identification

Threads within a bundle are identified by system-defined constants.  There are two categories under which a thread can be identified: 1) globally among all threads of a bundle and 2) within a dimension of a sub-bundle.  There are three variables for each of these categories: 1) the number of threads within the category, 2) the log of the number of threads within the category (if the number is a power of 2), and 3) the sequential identifier of the thread within the category.  The following are the definitions of the system constants:

 

$$Name    - the name of the bundle of threads (char*). 

$$D           - the number of dimensions of the bundle. 

$$N           - the total number of threads in the bundle (over all sub-bundles).  

$$L           - the log base 2 of the total number of threads in the bundle (equals -1 if $$N is not a power of 2). 

$$i           - the thread identifier of the thread with respect to all the threads of the bundle. 

$$ii         - the physical thread identifier (where the thread is executing within an architecture). 

$$Nx[d]  - the number of threads of d-th dimension of the bundle/sub-bundle. 

$$Lx[d]  - the log base 2 of the number of threads of d-th dimension of the sub-bundle (equals -1 if $$Nx[d] is not a power of 2). 

$$ix[d]  - the thread identifier of the thread in d-th dimension of the bundle/sub-bundle. 

 

The following example uses the bundle description:

         threads { B[2][2] } A[2];

A has 2 threads.  In the diagram below, there is a pair of numbers under each A.  The first number is its value for $$i, and the second is its value for $$ix[0].  There is a triplet under each thread of B.  The first number is its value of $$i, the second is its value of $$ix[1], and the third is its value of $$ix[0].  Note that the threads of 'B' form two sub-bundles, one under A[0] and another under A[1].  A sub-bundle consists of all the threads of a bundle that have the same parent thread (contained in the parent bundle.)  For example, 'A' is the parent bundle of 'B' and B[4], B[5], B[6], and B[7] make up a sub-bundle of 'B' whose parent thread is A[1] of bundle 'A' .

The value of the other system constants of A are:

               $$NAME = "A",  $$D = 1,  $$N=2,  $$L=1,

               $$Nx[0]=2, $$Nx[1]=<undefined>,

               $$Lx[0]=1, and $$Lx[1]=<undefined>.

The values of the other system constants of B are:

               $$NAME = "B",  $$D = 2,  $$N=8,  $$L=3,

               $$Nx[0]=2, $$Nx[1]=2,

               $$Lx[0]=1, and $$Lx[1]=1.

 

I/O

I/O is defined as an order sequence of data that is to be placed in a file.  The data can be placed in the file concurrently, but the order within the file will be properly ordered sequentially.  I/O in aCe is in order of thread identifier, as in the following example:

threads CLX3[1000]

CLX3.{ int A; A=$$i; fwrite(&A,sizeof(int),1,FILEptr); }

The thousand values A will be written to the file whose descriptor is located at FILEptr.  If the above code is contained within a conditional statement, only a subset of the values of A, corresponding to the active threads, will be written to the file.

threads CLX3[1000]

CLX3.{ int A; A=$$i;

         if ($$i<4 || $$i>995 )

                fwrite(&A,sizeof(int),1,FILEptr);

}

The above code segment will write eight values into the file, four from the first four threads of CLX3 and four from the last four threads of CLX3 in that order. 

aCe has another form of I/O, it is called fast I/O.  The fast I/O routines ffopen, ffread, ffwrite, and ffclose correspond to the standard I/O routines fopen, fread, fwrite, and fclose, except that their use is very machine dependent.  Files written by fast I/O must be read by fast I/O.  Also, files written with fast I/O must also be read by a bundle with exactly the same geometry as the bundle that wrote it.  The reason for this is that fast I/O is intended to be implemented with the fastest form of I/O available on the architecture, which may differ from architecture to architecture. 

Communications Path Descriptions

Communication between two threads is defined by a communication expression.  A communication expression consists of two parts: the communication path description or router expression, and the remotely evaluated numeric expression.  In the case of a fetch, the remotely evaluated numeric expression must evaluate to a value, while in the case of a send operation, it must evaluate a remote address.  The router expression is used to define many concurrent communication paths from threads of one bundle to the threads of another, or even to the same bundle. 

In the following example,  "B[0]." is the router expression, (s+1) is the remote expression that is executed on the threads of B, and t is the variable in each of the threads of type A to which the values received from the threads of type B are stored. 

threads A[1]; threads B[1];

A.{  int t;  B int s;   t = B[0].(s+1);  }

The threads of B need not be currently active to evaluate the expression (s+1). However, a thread of B does need to be a thread that will be fetched from for the expression to be evaluated.  Though a thread may have mainly threads fetching from it, the expression will only be evaluated once.  In effect, this technique can be used to temporarily reactivate threads. 

The router expression describes the path between the remote execution context and the local execution context.  If the router expression is the source of a value (a get or fetch operation), the remote context computes a value, and that value is fetched by the local context from the remote thread that is described by the router expression.  The previous example demonstrates communications between two single-thread bundles.  When multi-thread bundles communicate, each thread of the local context computes an address of a remote thread.  This address is based on the router expression.  If we modify the above example, we see below that each thread of the local bundle A fetches from the corresponding thread of the remote bundle B.  Note that the system constants $$i, which belong to the threads of A, is be used to cause the threads of A to fetch from the corresponding threads of B.  This can be viewed as explicitly computing relative thread addresses. 

threads A[16]; threads B[16];

A.{  int a;  B int b;   a = B[$$i].(b+1);  }

Values may be fetched from existing, and yet not necessarily active, threads.  If an attempt is made to fetch from a non-existent thread, the result is undefined, possibly due to an incorrectly computed thread identifier.  If values are to be fetched from inactive threads, they are temporarily made active.  In general, a get operation activates all threads that will be fetched from and deactivates threads that will not be fetched from for the duration of the remote context's execution no matter what the active state of the remote context was prior to the communication.  In the above example, (b+1) is the remote context for the get operation.  In the following example, the router expression will generate invalid addresses for some local threads, therefore a conditional is used to make sure that only valid thread addresses are fetched. 

threads E[16]; threads F[16];

E.{  int a;  F int b; 

       if ($$i+2<16) a = F[$$i+2].(b+1);  }

There are four types of path descriptions: absolute addressing, universal addressing, relative addressing, and reduction addressing  Absolute addressing assumes the path description starts at MAIN and proceeds down the tree to the remote thread.  Given:

     threads { { G[2], H[3] } A[1], B[2] } C[3] ;

To locate the value of (b+1) referenced in the statement below we traverse the graph from MAIN to the xth thread of C.  From this node, we then proceed to the yth thread of B.  Note that x and y may have a different value for each thread of H and that the path for each thread of H is to a different thread of B. 

H.{  int a,x,y;  B int b;   a = C[x].B[y].(b+1);  }

The following diagram shows the highlighted path for an thread of H where x=1 and y=0:

Universal addressing is like absolute addressing but need not start at a child of MAIN.  Universal addressing treats all threads of a bundle as if they were contained in a single one-dimensional bundle.  For example, in the following code, threads of G are children of threads of K, which are children of threads of J: 

threads { { G[3][4] } K[5][7] } J[3][6][7] ;

J.{  int a,x;  G int b;   K[x].G[1][2].b = a;  }

Data of threads of J need to be sent to threads of G, which are not children of that thread of J.  However, each thread of type J contains the value of the ID, x, of the thread of K that is the parent of the thread of G where the data is to be sent.  Thus, the value of x is used as the index into the one-dimensional array of threads of K as the starting point of the path.  The rest of the path is treated as if it were absolute addressing. 

Relative addresses are computed with respect to the local thread's and/or its parent's address.  If a non-parent is in the path description, the address calculation with respect to the non-parent will be calculated as an absolute address. 

A relative router expression is prefixed with a period (.) and is assumed to start at the local thread's bundle or one of its parents.  The following is an example of a path from a thread of H to a thread of B. 

H.{  int a,x,y;  B int b;   a = .A.C[x].B[y].(b+1);  }

The following diagram shows the highlighted path to fetch a remote value from a thread of B (C[0].B[1]) to a local thread of H (C[2].A[0].H[1]) using relative addressing where x=1 and y=1:

The address of C[x] is calculated relative to the parent bundle of the H thread modulo the number of threads of C.  Yet, B[y] is treated as an absolute address rather than relative since B is not a parent of H.  The bundle A specification does not have an array index because it is not needed.  Parents of the local thread do not need indices if they are specifying the path up the bundle hierarchy toward MAIN.  Relative path calculations are done modulo the size of the dimensions of the sub-bundle, thus toroidally connecting the sub-bundle's threads.  Actually, it was unnecessary to specify the bundle A in the router expression since C is a grandparent of H.  The expression could have been written, ".C[x].B[y]".

Reduction addressing can be viewed as all-to-all communications.  Every thread of the source bundle sends a value to every thread of the destination bundle.  This mode of addressing is useful as well as efficient for performing global reductions.  It is also efficient if a value is needed from the threads of a bundle by the threads of another if the value is the same for all threads.  Reduction addressing is performed by designating the name of the bundle from which a value is to be reduced.

threads H[100];  threads B[300][400];

H.{  int a;  B int b=z;   a = B.b ;  }

In the above example note the expression "a=B.b".  Although B is a two-dimensional thread array, the path from B is designated by its name alone and no indices.  The expression will get the value of b of the active thread of B with smallest ID value and distribute it to a of all the active threads of H.  The next example is functionally equivalent to the previous, except that the values of b are being sent to a rather that being fetch by H and assigned to a. 

threads H[100];  threads B[300][400];

B.{  H int a;  int b=z;   H.a = b ;  }

Global reductions can be performed by this addressing mode by replacing the = with operations such as +=.  In this case, the values b of all active threads of B are summed and added to a of all threads of H. 

threads H[100];  threads B[300][400];

B.{  H int a;  int b=z;   H.a += b ;  }

Global reductions are limited to the following operations:  addition (+=), subtraction (-=), and (&=), or (|=), exclusive-or (^=), minimum (<?=), and maximum (>?=).

Put and Get

Data is fetched from or sent to the threads of a remote bundle depending on whether the router expression is on the right or left side of an assignment.  If the router expression is on the right side of an assignment, it is a get from the remote threads. If the expression is on the left side of an assignment, it is a put to the remote threads. 

The following is an example of a get operation: 

H.{  int a,x,y;  B int b;   a = .A.C[x].B[y].b ;  }

Reversing the sides of assignment makes it a put operation: 

H.{  int a,x,y;  B int b;   .A.C[x].B[y].b = a;  }

The value of a in H is sent to the specified remote thread b of B.  If a reduce add operation such as,

H.{  int a,x,y;  B int b;   .A.C[x].B[y].b += a;  }

is performed and multiple values are sent to the same thread, they are summed together.  Since more than one thread can send to the same thread, the data will either have to be combined, or some will be lost.  Therefore, when data is sent to another thread, there are several options for combination into the remote location, such as addition (+=), subtraction (-=), multiplication (*=), division (/=), and (&=), or (|=), exclusive-or (^=), minimum (<?=), and maximum (>?=). 

A put operation returns a flag to the expression in which it is contained, rather than a value.  This flag indicates whether or not the value being sent actually was received at the destination thread.

H.{  int a,x,y; B int b; flag=(.A.C[x].B[y].b = a) ;  }

In the above example, the values of a in bundle H are being sent to the variables b in bundle B.  The value of the flag, after the values have been sent, is TRUE if the specific value of a actually reached the requested instance of b and FALSE if it did not.  There are two reasons a value may not reach its destination: 1) the address of the destination thread is not a valid one, or 2) the put operation is '='.  If the put operation is '=', at most one value will be received by any destination thread.  Therefore, only one of the source threads that sent to the same destination thread will have its value received and its receive flag set to TRUE. 

Pre-computed Paths

A fair amount of time spent in a communications operation may be spent computing the identifiers of the remote threads involved in the communications.  The ability to define a path description as a variable that is computed at run time allows a path description to be pre-computed and optimized once, and yet used many times.  This amortizes the cost of computing a path descriptor across multiple uses of the descriptor.  A path is declared as follows:

H.{  path(B) toB; int a0,a1,a2,x,y;  B int b0,b1,b2;

       toB = C[x].B[y]. ;

       @toB.b0 = a0 ; @toB.b1 = a1 ; @toB.b2 = a2 ;  }

The path toB is a path from the local context of H to the remote context of B.  Note in the above example, toB was computed once but was used in three communications operations.  Pointers to or arrays of paths may also be declared. 

H.{ path(B) toB[4];  @toB[1].b1 = a1 ; }

H.{ path(B) *toB;  @(*toB).b1 = a1 ; }

Note, an array element from a path array need not be enclosed in parentheses when used, but a pointer to a path or any address expression that points to a path does. 

A reduction description can not be assigned to a path.  Any other path description can be (absolute, universal, or relative). 

Generic Routines

Most routines are specific to only one bundle.  Some, however, are useful to many bundles.  These are referred to as generic routines.  A generic routine is declared by preceding it with the key word generic in place of a specific bundle name.  Trigonometric functions are examples of such routines.  There is nothing about a sine function, for example, that makes it inherently specific to any bundle.  A sine function can be applied to all threads of a bundle and is trivially parallel (i.e., requires no communication between threads.)  This makes it a simple generic routine, and allows it to be executed within the context of any bundle.  An example of a simple generic routine is:

generic double sin (double x) { ... C code ... }

Generic routines can include inter-processor communication.  These are complex generic routines.  A complex generic routine can have a bundle as an argument.  To pass a bundle into a generic routine other than the bundle of the context, one prefixes the argument with the key word generic.

generic double func (

       generic(other),

       double x,

       other int y )

{ int a,b;  other int c,d;   a = other[b].d;   }

The bundle passed into the routine is other.  Note that variables, x, a, and b, belong to the context bundle and variables y, c, and d, belong to the bundle passed in as other.

To pass a bundle description into a complex generic routine, the bundle name is the argument.

threads A[200]; threads B[400];

A.{ double a,b;  B int c;   a = func(B,b,c);   }

Function func is called by bundle A and is passed the bundle B.  Note that b is a variable of bundle A and c is a variable of bundle B. 

A generic or a "argument passed" bundle can have multiple dimensions.  The dimensions' sizes may either be specified as in case 1 or unspecified as in case 2.  Case 1 will only allow bundles with dimensions [2][500][30] to be passed as an argument, while case 2 will allow any three-dimensional bundle to be passed. A bundle being passed as an argument must have the same number of dimensions as the generically declared bundle.

Case  1:

generic(other[2][500][30])

Case  2:

generic(other[ ][ ][ ])

If the generically declared bundle does not specify any dimensions, then a bundle with any number of dimensions may be passed, but it will be treated as a one-dimensional bundle.  The index value of this array may range from 0 to $$N-1 of the bundle. 

A generically declared bundle may also specify the child bundles it must have to be passed.  All bundles of the passed bundle must have at least as many child bundles and the same number of dimensions as the argument bundle.  Also, the dimensions' sizes must be specified and have the same values.  For case 3, A and C are valid passed bundles for OTHER, bundles B, D, and E are not. 

threads  { a[2], b[4][6] }          A[200];

threads  { d[2], e[5][6] }          B[400];

threads  { f[2], g[4][6], h[5] }    C[300];

threads  { j[2], k[4], m[5] }              D[300];

threads  { n[2], p[5], q[4][6] }    E[300];

Case  3:

generic( { y[2], z[4][6] } OTHER )

A is an exact match.  B is invalid because e does not have the same dimension size.  C is valid even though it has three children.  The third child is simply ignored since f matches y and g matches z.  D is invalid because k has only one dimension.  E is invalid even though children of the right size exist they are not the first two children of E. 

The context bundle of the function may also be passed in the routine by name.  This is done using the same syntax as declaring a bundle to be passed by argument except it replaces the keyword 'generic' preceding the routine declaration as in Case 4.

Case 4:

generic ( { y[2], z[4][6] } OTHER ) double Func (double x) { ... C code ... }

In Case 4 the context bundle in which 'Func' is called must be of the form of OTHER, as if it were a passed bundle.  Without this capability, the context bundle would not be able to be referenced explicitly.  This is not a problem with functions like trigonometric functions that require no inter-thread communications.  But it would be for, say, a generic FFT routine.  Any bundle that has been explicitly declared as a bundle argument or a generic bundle context may be used within the context of the function as any bundle specification can be used.

Realignment of Threads with Processors

If no realignment is specified for a bundle of threads, they will be distributed across the physical processors in a default manner specific to the given architecture.  This often does not represent a very effective arrangement given the communications patterns of a specific algorithm.  The realignment statement allows for the rearrangement of the threads across the physical processors relative to the default arrangement. 

Actual threads are mapped across physical processors in an order dependent on the architecture.  For example, an aCe program running on a PC cluster running MPICH as its message passing protocol may distribute threads in the following manner:  given that there are 'n' threads to be distributed over 'p' processors and 'd=n/p', the first 'd' threads will be allocated to the first processor, the second 'd' threads to the second processor, and so on until all threads have been allocated.  If 'n' is not evenly divisible by 'p', then 'd=n/p+1' and the last processor will have 'n mod p' threads.

The idea of realignment is to re-map the logical thread identifier to a different physical thread identifier.  The constant $$i is the logical thread identifier, and $$ii is the physical thread identifier.  If the threads of a bundle are not realigned, then $$i will be equal to $$ii for all threads in the bundle. 

Rearrangement is the permuting of logical thread relative to a fixed arrangement of physical threads.  In its simplest form, this is done by defining an array of logical threads and permuting the dimension of the array.  For example, if you have a two-dimensional array and you swap the two dimensions, this is equivalent to performing a transpose on the elements of the array.  The following example shows how a transpose array of logical threads is defined: 

threads  X[32][32];

realign X [32][32] --> [32][32] ; [0] --> [1] ; [1] --> [0] ;;

Realignment is described by a realignment descriptor statement.  (Refer to the preceding example of the bundle 'X'.)  The realignment keyword is 'realign'.  It is followed by the name of the bundle to be realigned.  Next is two sets of dimensions separated by an arrow (-->).  The first set of dimensions describes an array that represents how the logical identifiers are to be viewed.  And the second represents how the physical identifies are to be viewed.  In this example, both the logical and the physical identifiers are represented by a 32x32 array of threads.  Following the array definitions are the mappings.  These mappings map one or more logical dimensions to one or more physical dimensions.  In this case, there are two mappings, each consisting of one logical dimension mapped to one physical dimension.  The first mapping maps the least significant logical dimension to the most significant physical dimension. The second mapping maps the most significant logical dimension to the least significant physical dimension.  This realignment effectively transposes the logical arrangement of threads relative to the physical arrangement.  This realign statement is equivalent to:

realign X [32][32] --> [32][32] ; [0][1] --> [1][0] ;;

The following example, uses C-like code to show how logical threads are mapped to physical threads: 

threads  X [27][10];

realign X [6][9][5] --> [9][5][6] ;

[0] --> [1] ; [1] --> [2] ; [2] --> [0];;

The above realignment description refers to a physical array of threads which could be represented by P[9][5][6] and a logical array of thread represented by L[6][9][5].  The above dimension mapping describes how to change the ordering of the dimensions of a multi-dimensional array (a generalized multi-dimensional transpose).  This description is equivalent to the following C-like code describing the mapping of physical threads to logical threads:

int L[6][9][5], P[9][5][6];

for (I2=0; I2<6; I2++) {

       for (I1=0; I1<6; I1++) {

              for (I0=0; I0<6; I0++) {

                      L[I0][I2][I1] maps_to P[I2][I1][I0] ;

}}}

This simple form of the alignment statement allows for any form of multi-dimensional array transposing relative to the default arrangement.  The next form that will be described allows for changing from an array with one number of dimensions to one of a different number of dimensions: 

threads Y[100][100];

realign Y [100][100] --> [10][25][20][2] ;

       [0][1]  -->  [1][3][2][0] ;;

In the above case, the logical array is define as a two-dimensional array, 100x100, and the physical array is a four-dimensional array of dimensions [10][25] [20][2].  The mapping (that follows the array declaration) says map the transpose of the logical array to an array of the physical thread whose dimensions have been reordered to be [20][10][25][2].  This mapping statement appears to request the reordering of the physical threads.  In reality, however, the physical threads will remain in the default order, and only the logical threads will be reordered so that mapping from logical to physical is consistent with the mapping statement.

Note that the original and the final arrays of threads are the same size.  This is not necessary.  If the physical thread array size is larger than the logical, the array is padded with inactive threads that never become active.  If the logical array is larger than the physical array size, then the final array descriptor is given an additional most significant dimension large enough to make the physical array just larger than the logical. 

threads Y[100][100];

realign Y [100][100] --> [32][32] ; [1][0] --> [0][1] ;;

In the above example, the alignment statement is effectively equivalent to:

realign Y [100][100] --> [10][32][32] ;

[1][0] --> [2][0][1] ;;

There is one final form of the alignment state. The realignment description describes not only how an array may be rearranged or reformed into another array, but it can also be used to describe how sub-arrays may be reformed into different sub-arrays.  An alignment statement may consist of multiple mappings that reorder subsets of the dimension, some of which have no explicitly defined size.

threads Z[640][480];

realign Z [640][480] --> [][][64][32] ;

[1] --> [3][1]; [0] --> [2][0] ;;

The compiler will fill in the blank sizes.  Dimensions with blank sizes may only be used as most significant dimensions in a mapping array.  The compiler will translate the above realignment statement into the following:

realign Z [640][480] --> [10][15][64][32] ;

[1] --> [3][1]; [0] --> [2][0] ;;

For realignment to be of significant usefulness, the programmer will need to find out what the default arrangement of threads are for the specific physical architecture he wants to realign a bundle to. 

 

The Virtual/Scalar Distinction

So far, the programmer has been presented with a view of a variable declared over a bundle as having an independent value for every thread of that bundle.  This can lead to extreme inefficiency of both memory and compute cycles if implemented to its logical conclusion.  The aCe language is designed to maintain the totally virtual illusion but is implemented in a way that takes advantage of variables that have the same value in every thread (a scalar value).  This is done by recognizing variables that are assigned only constants or other scalar variables.  These variables are designated as 'scalar'.  Most other variables are designated as 'virtual'.  It would not be necessary to make the programmer aware of this in general except for the fact that aCe interfaces with the native programming environment.  The routines developed under the native programming environment expect parameters and return values of specific types, either scalar or virtual, but not both.  Therefore, when a native code routine is declared within aCe, the arguments and parameters must be explicitly designated as either scalar or virtual.  Appendix A contains examples of the usage of the keywords scalar and virtual. 

 

 

Appendix A:  Current Standard aCe Library Calls

A.1 Assert (see assert.aHr).

            (The libraries for 'assert' are not currently implemented.)

 

A.2 ctype (see ctype.aHr).

 

generic int    isalnum     ( int c );

generic int    isalpha     ( int c );

generic int    iscntrl     ( int c );

generic int    isdigit     ( int c );

generic int    isgraph     ( int c );

generic int    islower     ( int c );

generic int    isprint     ( int c );

generic int    ispunct     ( int c );

generic int    isspace     ( int c );

generic int    isupper     ( int c );

generic int    isxdigit    ( int c );

generic int    tolower     ( int c );

generic int    toupper     ( int c );

 

A.3 Error Number (see errno.aHr).

            (The libraries for 'errno' are not currently implemented.)

 

A.4 Locale Routines (see locale.aHr).

            (The libraries for 'locale' are not currently implemented.)

 

A.5 Math Routines (see math.aHr)

 

generic double       cos        (double x);

generic double       sin        (double x);

generic double       tan        (double x);

generic double       acos       (double x);

generic double       asin       (double x);

generic double       atan       (double x);

generic double       atan2      (double x, double y);

generic double       sinh       (double x);

generic double       cosh       (double x);

generic double       tanh       (double x);

generic double       exp        (double x);

generic double       log        (double x);

generic double       log10      (double x);

generic double       frexp      (double value, int *exp);

generic double       ldexp      (double x, int e);

generic double       modf       (double value, double *iptr);

generic double       pow        (double base, double exp);

generic double       sqrt       (double x);

generic double       floor      (double x);

generic double       ceil       (double x);

generic double       fmod       (double x, double y);

generic double       fabs       (double x);

 

A.6 Scan Functions (see scan.aHr).

 

generic int       scan_and          ( int V, char M ) ;

generic int       scan_or           ( int V, char M ) ;

generic int       scan_xor          ( int V, char M ) ;

 

generic int       scan_add          ( int V,    char M ) ;

generic float     f_scan_add        ( float V,  char M ) ;

generic double    d_scan_add        ( double V, char M ) ;

 

generic int       scan_min          ( int V,    char M ) ;

generic float     f_scan_min        ( float V,  char M ) ;

generic double    d_scan_min        ( double V, char M ) ;

 

generic int       scan_max          ( int V,    char M ) ;

generic float     f_scan_max        ( float V,  char M ) ;

generic double    d_scan_max        ( double V, char M ) ;

 

A.7 Set Jump (see setjmp.aHr).

            (The libraries for 'setjmp' are not currently implemented.)

 

A.8 Signals (see signal.aHr).

            (The libraries for 'signal' are not currently implemented.)

 

A.9 Standard Arguments (see stdarg.aHr).

            (The libraries for 'stdarg' are not currently implemented.)

 

A.10 Standard I/O (see sdtio.h).

 

generic scalar FILE*    fopen

                           ( scalar char *filename, scalar char *mode) ;

generic scalar int      fflush      ( scalar FILE* stream ) ;

generic scalar int      fclose      ( scalar FILE* stream ) ;

 

generic scalar int      fread

                           ( virtual void *virtual ptr,

                           scalar size_t size,  scalar size_t nobj,

                           scalar FILE* stream ) ;

generic scalar int      fwrite     

                           ( virtual void *virtual ptr,

                           scalar size_t size,  scalar size_t nobj,

                           scalar FILE* stream ) ;

 

generic virtual int     printf

                           ( scalar char * scalar format, virtual ... ) ;

generic virtual int     sprintf ( virtual char *virtual buff,

                           scalar char * scalar format, virtual ... ) ;

generic virtual int     fprintf`    ( scalar  FILE *scalar  file,

                           scalar char * scalar format, virtual ... ) ;

 

A.11 Standard Library (see stdlib.aHr).

 

generic double          atof  ( const char*  s );

generic int             atoi  ( const char*  s );

generic long            atol  ( const char*  s );

 

generic double          strtod

                        ( const char* s, char** endp );

generic long            strtol

                        ( const char* s, char** endp, int base );

generic unsigned long   strtoul

                        ( const char* s, char** endp, int base );

 

generic scalar void     abort ( void );

generic scalar void     exit  ( scalar int status );

 

generic int             abs   ( int n );

generic long            labs  ( long n );

generic div_t           div   ( int  num, int  denom );

generic ldiv_t          ldiv  ( long num, long denom );

 

A.12 String Routines (see string.aHr).

 

generic char*     strcpy        ( char*  s, char* ct );

generic char*     stnrcpy       ( char*  s, char* ct, int n );

generic char*     strcat        ( char*  s, char* ct );

generic char*     strncat       ( char*  s, char* ct, int n );

generic int       strcmp        ( char* cs, char* ct );

generic int       strncmp       ( char* cs, char* ct, int n );

generic char*     strchr        ( char* cs, char   c );

generic char*     strrchr       ( char* cs, char   c );

generic size_t    strspn        ( char* cs, char* ct );

generic size_t    strcspn       ( char* cs, char* ct );

generic char*     strpbrk       ( char* cs, char* ct );

generic char*     strstr        ( char* cs, char* ct );

generic size_t    strlen        ( char* cs );

generic char*     strerror      ( int n );

generic char*     strtok        ( char*  s, char* ct );

 

generic void*     memcpy        ( void*  s, void* ct, int n );

generic void*     memmove       ( void*  s, void* ct, int n );

generic int       memcmp        ( void* cs, void* ct, int n );

generic void*     memchr        ( void* cs, int    c, int n );

generic void*     memset        ( void*  s, int    c, int n );

 

A.13 Time Routines (see time.aHr).

            (The libraries for 'time' are not currently implemented.)

 

References

1.      Dorband, J.E., "MPP Parallel Forth", Proceedings of the First Symposium on the Frontiers of Massively Parallel Scientific Computation, pg. 211-215, 1986.

2.      Hamet, L.E., Dorband, J.E., ÒA generic fine-grained parallel CÓ, Proceedings of the Second Symposium on the Frontiers of Massively Parallel Computation, October 1988, Fairfax, VA, pp 625-628.

3.      Iverson, K.E.,  A Programming Language,  Wiley,  New York,  1962.

4.      Reeves A.P., Bruner J.D., The Language Parallel Pascal and other Aspects of the Massively Parallel Processor, School of Electrical Engineering, Cornell University, December 1982.

5.      Rose, J., Steele, G., "C*: An Extended C Language for Data Parallel Programming", Presented at the Second International Conference on Supercomputing, May 1987.

6.      Steele, G.,  Wholey,  S., "Connection Machine Lisp: A Dialect of Common Lisp for Data Parallel Programming," August, 1987.

7.      DAP-FORTRAN Language,  International Computers Ltd.,  TP 6918.

8.      MPL (MasPar Programming Language) Reference Manual, MasPar Computer Corp.,  Pt No. 9300-9034-00 Rev A2.