Next: VIA vs. TCP/IP
Up: Background
Previous: Apparatus
MPI provides a simple, portable, standard software interface for
communicating between processes, and a way of assigning processes to
processing nodes which is transparent to the programmer and running
processes. The programs discussed in this thesis all start as several
processes stared on workstations as specified in an MPI configuration
file. The configuration file tells the MPI start utility which
processes to run on which node. Once running, the processes pass
messages by simply specifying:
- A process ID number to send to or receive from
- A memory buffer pointer to send from or to put received data in
- The type of elements in the buffer
- The size of the buffer (it is assumed to be an array)
- A message type tag, which is an integer a programmer can use for
filtering
- The MPI communicator to use
MPI has functions for several different types of communication
transactions. Types of transactions include passing messages from one
process to another, one process to all processes (broadcast or
scatter), many processes to one (gather), or even all to all.
Operations which require the participation of all processes in a group
are called collective operations. In collective communications a
process receiving data from all processes truly receives data from all
processes in its group: It must even send data to itself, or the
operation fails. A special value may be used in place of an address
to receive data from anywhere. Of course, all to all communications
don't require an address.
The advantage of using process numbers for addressing is that the
programmer is freed from worrying about the details of lower level
networking protocols, such as setting up and taking down connections,
network addresses, or which protocol or interface to use. The MPI
libraries dynamically handle routing on the level of processes, so the
node on which a process is running may be changed in the configuration
file without changing the software. Since processes communicate with
each other by process number, they don't care if a process they are
exchanging data with is on the same machine or some other machine.
The use of the type tag is left entirely to the programmer. One may
ignore it and just use the same number for all operations. One may
use it for sequencing or error checking. A process may have several
threads, and use the tag to specify which thread a process is to or
from. Or the tag can specify the content of a message, so a recipient
can treat the data differently depending on the value of the type tag.
This last option can be especially useful since messages are sent as
arrays of simple data types. While it is possible to define complex
types which MPI understands, it is often easier to treat the
message as raw memory and cast it to the appropriate type.
Using a library of standard functions allows the code to be easily
portable to different operating systems or MPI library
implementations. MPI libraries can dynamically use different system
interconnects, for instance shared memory between processes on the
same computer, or Ethernet between processes on different computers.
This is all transparent to the programs calling MPI functions, and
allows processes to not care about the physical location of
other processes or the type of system interconnect.
The MPI start up utility allows processes to be started on any machine
without changing the source code or recompiling anything. The
configuration file is a text file in which each line specifies an
executable to use for starting a process, how many instances of that
process to start, and on which machine they should be started. This
makes it easy to change the physical topology of processes when they
are started up. One may easily try different distributions of
processes on available nodes and test the optimal arrangement for a
given algorithm and computer system. Processes are started in the
order in which they appear in the configuration file.
New MPI processes can also be spawned by running MPI processes. The
parameters used are the same as those used in the configuration file.
This can be useful in situations where the algorithm has enough
flexibility that it can only be determined at run time how many
processes are needed. It may also be useful if a program in intended
to run on different systems, and can adapt itself to different
numbers of
nodes, processors, or other system resources.
Each process is given a unique ID based on the order in which they are
started, with the first process being 0. The programmer can make use
of this feature when implementing parallel algorithms. For instance,
if an executable serving as a dispatcher is started first, all
processes can count on it having ID 0. Or identical executables can
configure themselves into a tree or other virtual topology based on
their IDs and the total number of processes in their group. For
instance, if a process starts up and finds that there are 7 process
running, it can take the role of binary tree root with children 1 and
2 if its ID is 0, or if it finds that its ID is 4 it can take the role
of leaf with parent 1.
MPI process which are started together share an MPI communicator. A
communicator is a data structure which defines a group of process
which may communicate with each other. It is a general graph with
processes as nodes. All processes which are started together are
automatically added to a communicator called MPI_World, which is
fully connected. There are functions which can form communicators of
any desired topology. MPI_CartesianGraphCreate [check name] can form
an n by n graph, and MPI_GraphCreate can be used to form a graph of
any arbitrary shape. A new communicator does not need to include all
processes in MPI_World. The MPI communication facilities can use the
graph information for setting up communications links. For algorithms
which involve processes communicating in specific patterns, creating a
graph can enforce the patterns by prohibiting communications between
certain processes.
It is important to carefully choose which process is running on which
node, since one of the biggest concerns when distributing computation
across a network is balancing communication latency between processes
with the speed up of parallel computation. Parallel programs usually
have a known process topology, where each process communicates
exclusively with a small subset of the total number of processes, and
the order and size of messages passed, within some limits, are
generally known. Given a virtual topology of processes, it is possible
to optimize the location of processes for a given physical cluster.
One wants to minimize the time spent passing data over system
interconnections, e.g. the network, while maximizing the amount of
processing done in parallel. It is important to remember that there
is no performance advantage to distributing processing if the overhead
of distribution, e.g. network latency, results in computation taking
longer to complete than it would have on a single node. For example,
if a computation takes 10 seconds to complete on one node, and can be
divided in two equal parts for parallel computation, then there would
be an advantage to distribution if the overhead was 4 seconds (total =
4 seconds distribution overhead + 5 seconds parallel computation = 9
seconds) but not if the overhead was 6 seconds (total = 6 seconds
distribution overhead + 5 seconds parallel computation = 11 seconds).
Obviously, one may take greater advantage of software parallelism the
lower the latency of interprocess communication.
By using ServerNet's high speed VIA NICs, MPI can send data directly
and efficiently to processes on another machine. The VIA protocol is
discussed in more detail in the next section. MPI data movement
commands can be turned into VIA data movement descriptors. The user
agent tells the hardware to transfer blocks of data from one process'
memory to another's, with little intervention from the CPU. The
sending and receiving end may both use this simple, low latency method
for transferring data. Alternatively, MPI supports APIs for remote
direct memory access (RDMA), where one process transfers blocks of
memory from another process without the participation of that other
process. These two data transfer methods
correspond to those supported by VIA. [via97]
Next: VIA vs. TCP/IP
Up: Background
Previous: Apparatus
Garth Brown
2001-01-26