MPI

Next: VIA vs. TCP/IP Up: Background Previous: Apparatus

MPI

MPI provides a simple, portable, standard software interface for communicating between processes, and a way of assigning processes to processing nodes which is transparent to the programmer and running processes. The programs discussed in this thesis all start as several processes stared on workstations as specified in an MPI configuration file. The configuration file tells the MPI start utility which processes to run on which node. Once running, the processes pass messages by simply specifying:

A process ID number to send to or receive from
A memory buffer pointer to send from or to put received data in
The type of elements in the buffer
The size of the buffer (it is assumed to be an array)
A message type tag, which is an integer a programmer can use for filtering
The MPI communicator to use

MPI has functions for several different types of communication transactions. Types of transactions include passing messages from one process to another, one process to all processes (broadcast or scatter), many processes to one (gather), or even all to all. Operations which require the participation of all processes in a group are called collective operations. In collective communications a process receiving data from all processes truly receives data from all processes in its group: It must even send data to itself, or the operation fails. A special value may be used in place of an address to receive data from anywhere. Of course, all to all communications don't require an address.
The advantage of using process numbers for addressing is that the programmer is freed from worrying about the details of lower level networking protocols, such as setting up and taking down connections, network addresses, or which protocol or interface to use. The MPI libraries dynamically handle routing on the level of processes, so the node on which a process is running may be changed in the configuration file without changing the software. Since processes communicate with each other by process number, they don't care if a process they are exchanging data with is on the same machine or some other machine.
The use of the type tag is left entirely to the programmer. One may ignore it and just use the same number for all operations. One may use it for sequencing or error checking. A process may have several threads, and use the tag to specify which thread a process is to or from. Or the tag can specify the content of a message, so a recipient can treat the data differently depending on the value of the type tag. This last option can be especially useful since messages are sent as arrays of simple data types. While it is possible to define complex types which MPI understands, it is often easier to treat the message as raw memory and cast it to the appropriate type.
Using a library of standard functions allows the code to be easily portable to different operating systems or MPI library implementations. MPI libraries can dynamically use different system interconnects, for instance shared memory between processes on the same computer, or Ethernet between processes on different computers. This is all transparent to the programs calling MPI functions, and allows processes to not care about the physical location of other processes or the type of system interconnect.
The MPI start up utility allows processes to be started on any machine without changing the source code or recompiling anything. The configuration file is a text file in which each line specifies an executable to use for starting a process, how many instances of that process to start, and on which machine they should be started. This makes it easy to change the physical topology of processes when they are started up. One may easily try different distributions of processes on available nodes and test the optimal arrangement for a given algorithm and computer system. Processes are started in the order in which they appear in the configuration file.
New MPI processes can also be spawned by running MPI processes. The parameters used are the same as those used in the configuration file. This can be useful in situations where the algorithm has enough flexibility that it can only be determined at run time how many processes are needed. It may also be useful if a program in intended to run on different systems, and can adapt itself to different numbers of nodes, processors, or other system resources.
Each process is given a unique ID based on the order in which they are started, with the first process being 0. The programmer can make use of this feature when implementing parallel algorithms. For instance, if an executable serving as a dispatcher is started first, all processes can count on it having ID 0. Or identical executables can configure themselves into a tree or other virtual topology based on their IDs and the total number of processes in their group. For instance, if a process starts up and finds that there are 7 process running, it can take the role of binary tree root with children 1 and 2 if its ID is 0, or if it finds that its ID is 4 it can take the role of leaf with parent 1.
MPI process which are started together share an MPI communicator. A communicator is a data structure which defines a group of process which may communicate with each other. It is a general graph with processes as nodes. All processes which are started together are automatically added to a communicator called MPI_World, which is fully connected. There are functions which can form communicators of any desired topology. MPI_CartesianGraphCreate [check name] can form an n by n graph, and MPI_GraphCreate can be used to form a graph of any arbitrary shape. A new communicator does not need to include all processes in MPI_World. The MPI communication facilities can use the graph information for setting up communications links. For algorithms which involve processes communicating in specific patterns, creating a graph can enforce the patterns by prohibiting communications between certain processes.
It is important to carefully choose which process is running on which node, since one of the biggest concerns when distributing computation across a network is balancing communication latency between processes with the speed up of parallel computation. Parallel programs usually have a known process topology, where each process communicates exclusively with a small subset of the total number of processes, and the order and size of messages passed, within some limits, are generally known. Given a virtual topology of processes, it is possible to optimize the location of processes for a given physical cluster. One wants to minimize the time spent passing data over system interconnections, e.g. the network, while maximizing the amount of processing done in parallel. It is important to remember that there is no performance advantage to distributing processing if the overhead of distribution, e.g. network latency, results in computation taking longer to complete than it would have on a single node. For example, if a computation takes 10 seconds to complete on one node, and can be divided in two equal parts for parallel computation, then there would be an advantage to distribution if the overhead was 4 seconds (total = 4 seconds distribution overhead + 5 seconds parallel computation = 9 seconds) but not if the overhead was 6 seconds (total = 6 seconds distribution overhead + 5 seconds parallel computation = 11 seconds). Obviously, one may take greater advantage of software parallelism the lower the latency of interprocess communication.
By using ServerNet's high speed VIA NICs, MPI can send data directly and efficiently to processes on another machine. The VIA protocol is discussed in more detail in the next section. MPI data movement commands can be turned into VIA data movement descriptors. The user agent tells the hardware to transfer blocks of data from one process' memory to another's, with little intervention from the CPU. The sending and receiving end may both use this simple, low latency method for transferring data. Alternatively, MPI supports APIs for remote direct memory access (RDMA), where one process transfers blocks of memory from another process without the participation of that other process. These two data transfer methods correspond to those supported by VIA. [via97]

**Figure 1:** VIA Modules
$\includegraphics[height = 3in]{VI_arch.eps}$

Next: VIA vs. TCP/IP Up: Background Previous: Apparatus

Garth Brown
2001-01-26