MPI is a standard for allowing a team of processes to solve a problem together. Various implementations of that standard exist; MPICH is perhaps the most well known. Microsoft MPI mimics MPICH on Windows, as does Intel MPI on various platforms. Typically you’d then include these libraries in projects written in C, C++, or Fortran.
Furthermore, recent releases of Rtools have included support for Microsoft MPI. My plan is a couple of blogs on how to get this working, some exploration of performance, and an attempt at testing / CI. I’ll include nodes on what’s needed on Windows, linux or Mac to write and build the package, but when running multi-node programs, I’ll be limited to our departmental MS-HPC cluster. Hopefully the changes needed for launching scripts on other cluster platforms shouldn’t be too hard.
First steps.
First we need an MPI library. On Windows, installing
the latest Rtools and
MS-MPI
will do it. Test by running mpiexec
in a terminal, and on Windows check that it really is
Microsoft’s mpiexec
. Intel’s has the same name, and comes with the Intel C++ compiler, but
Microsoft’s must come first in your path, as we need the mpiexec
executable we use
to match the library in Rtools.
On linux, sudo apt-get install mpich.
is enough; on Mac, download MPICH
and it will install with homebrew.
What defines an MPI program
An MPI program involves a number of processes, all running the same single executable code. The processes would traditionally have been on different computers - one process on each - but they can also be stacked on the same compute node, or spread with some number of processes across some number of nodes.
The program must have exactly one call to MPI_Init
, during which all the processes handshake,
and by consensus they decide their ids - integers starting at zero - for each process in the
family. The id is known as the rank, and the number of processes is the size. After that
initialise step, each process knows its own rank, and can use it to decide a
subset of the total work to be done for example.
Finally, exactly one MPI_Finalize
call needs to happen (on every process) when all the MPI
work is finished, if you want a neat and successful exit.
Parallel Hello World
I’m going to make an R package called mpitest
. It’s
going to use the cpp11 package, so in the DESCRIPTION
file, I am including
LinkingTo:
cpp11
and conventionally, R/zzz.R
contains:
##' @useDynLib mpitest, .registration = TRUE
NULL
Here come the bits of C++ code, which are wrappers around some MPI functions, so we can call them from R.
We’ll firstly define a header, src/rmpi.h
for the MPI functions we’ll wrap.
#pragma once
#include <mpi.h>
void start_mpi();
int get_mpi_size();
int get_mpi_rank();
void end_mpi();
The implementations of these, using the cpp11 package annotations go in src/rmpi.cpp
.
See the MPI library
for the functions we are wrapping.
#include "rmpi.h"
[[cpp11::register]]
void start_mpi() {
MPI_Init(NULL, NULL);
}
[[cpp11::register]]
int get_mpi_size() {
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
return size;
}
[[cpp11::register]]
int get_mpi_rank() {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
return rank;
}
[[cpp11::register]]
void end_mpi() {
MPI_Finalize();
}
Now in R/source.R
, we’ll write a quick test:
hello <- function() {
start_mpi()
rank <- get_mpi_rank()
size <- get_mpi_size()
name <- Sys.info()[["nodename"]]
message(sprintf("Hello! I am %s (%s/%s)", name, rank, size))
end_mpi()
}
Lastly, we need to tell the compiler to include MPI in src/Makevars.win
:
PKG_CXXFLAGS = -lmsmpi
PKG_LIBS = -lmsmpi
For linux or MAC, we need slightly tweaked versions of src/Makevars
, so
this hack as /configure
will work for both:
#!/bin/bash
#make the Makevars file
if [ ! -e "./src/Makevars" ]; then
touch ./src/Makevars
fi
#if mac
if [[ `uname` -eq Darwin ]] ; then
echo "PKG_CXXFLAGS = -I/usr/include/x86_64-linux-gnu/mpich" > ./src/Makevars
echo "PKG_LIBS = -lmpich" >> ./src/Makevars
#if linux
elif [[ `uname` -eq Linux ]] ;then
echo "PKG_CXXFLAGS = -I/usr/include/x86_64-linux-gnu/mpich -lmpich" > ./src/Makevars
echo "PKG_LIBS = -lmpich" >> ./src/Makevars
fi
Now if we document, build and install the package, we can test it from the command-line or a terminal:
> mpiexec -n 4 Rscript -e mpitest:::hello()
Hello! I am WES-COMPUTER (0/4)
Hello! I am WES-COMPUTER (1/4)
Hello! I am WES-COMPUTER (3/4)
Hello! I am WES-COMPUTER (2/4)
Here, we are running the four processes on the same local machine, and we get text printed in a random order, as we’d expect.
With Multiple Nodes
If we want to spread the execution across different nodes, then we need a
cluster that is MPI-aware, that can launch the jobs on multiple nodes, and
tell the nodes about each other’s existence somehow. Our MS-HPC cluster
will do this for us; here’s the script we’ll ask the cluster nodes to run,
which I’ll call mpiwes.bat
, and I’ll save in a network folder called
\\homes\wes\test
which the cluster nodes can see.
set R_LIBS=\\homes\wes\R
set R_LIBS_USER=\\homes\wes\R
call setr64_4_3_0
Rscript -e "mpitest:::hello()"
First I am setting environment variables to point to the repo where my
mpitest
package is installed. The third line is then a helper on our cluster
to add the R version of our choice to the path. Lastly in this tiny example,
we’re calling the function we want inline as a parameter to Rscript.
To submit the job, with for example 8 processes across 2 nodes, I’d use the MS-HPC job submitter tool along these lines:-
job submit /scheduler:headnode /jobtemplate:template /numnodes:2 /singlenode:false
/stdout:mpiout.txt /stderr:mpierr.txt /workdir:\\homes\wes\test mpiexec -n 8 mpiwes.bat
The output of Rscript ends up (non-intuitively) sent to stderr
(mpierr.txt), which after the job has run contains:
Hello! I am HPC-093 (2/8)
Hello! I am HPC-093 (4/8)
Hello! I am HPC-093 (0/8)
Hello! I am HPC-093 (6/8)
Hello! I am HPC-095 (1/8)
Hello! I am HPC-095 (3/8)
Hello! I am HPC-095 (7/8)
Hello! I am HPC-095 (5/8)
So, the 8
processes we asked for were spread evenly over 2
nodes which the cluster
assigned to our job. As it happens, the two nodes actually had 32-cores, but the cluster
can only give us units of whole nodes. So we made pretty poor use of those nodes, reserving
all 32 cores but only using 4 of them. Perhaps we could have utilised them better by using
mpiexec -n 64
- if we knew in advance that the nodes each would have 32 cores, and if
our algorithm can actually use them well. But that’s for a later discussion.
End of Part One
Like all MPI work, it feels a little clunky to get started, but it works well enough to be useful. We can call MPI functions in both the cpp files, and R files pretty simply, which is a good start.
Also note in the examples above, the processes are all dumping their output to one file. It works for now because the amount of text is small, but for larger jobs, we should output separate files for each process, otherwise the text will eventually get interleaved in an untidy, unordered way. For now, file buffering on write is saving us.
In part two, I’ll explore how data can be shared between processes, local or remote, and some of the performance considerations of that.