Parallelism

b2000++ is hybridly parallel, using MPI for distributed memory parallelism and the shared memory parallelism offered by Intel’s Threading Building Blocks (oneTBB). TBB will automatically try to utilize all the cores that the system reports to be available. You can use the -nb-threads option of b2000++, to influence the maximum number of threads that should be used, but this is not a hard limit. To use TBB in b2000++ it has to be activated during compilation. b2000++ -version will tell you whether your b2000++ executable was compiled with this support (default) or not. To deactivate TBB during compilation set the option -DUSE_TBB=OFF when invoking cmake.

Most of the computational effort (memory and operations) during the simulation is spent in the linear equation solver. Thus, it is important to consider the parallelism in the linear equation solver as well.

When using MUMPS as the linear equation solver (the default in b2000++), the supported parallelisms are offered by OpenMP (shared memory) and MPI (distributed memory). If you want to make use of the distributed parallelism offered in MUMPS, both MUMPS and b2000++ have to be compiled with MPI support. Use the -DUSE_MPI=ON option for cmake in b2000++ to switch this parallelism on (the default for this is OFF). Also for MPI, b2000++ -version will report if this feature is available for your compiled executable. The OpenMP support is only baked into MUMPS, and whether it is available or not depends on the compilation of MUMPS and the BLAS library it is linked against. Usually this should be active. If nothing is set for OpenMP, it will also use all available cores as reported by the system (this may include “virtual” cores from hyperthreading). However, in contrast to the TBB parallelization we observe performance degradations when there are too many OpenMP threads used. “Too many” here typically means the number of cores available in single logical NUMA (non-uniform memory access) region is exceeded. Hence, you most likely want to limit the number of OpenMP threads. This can be achieved via the environment variable OMP_NUM_THREADS. b2000++ will warn you if this variable is not set to make you aware of this potential performance degradation.

The limitation in the shared memory parallelization for MUMPS can be countered by the distributed parallelism via MPI. This distributed parallelism also allows you to utilize more computational nodes and, thus, memory for your computations. To run b2000++ with MPI, prepend the command with mpiexec -np <NPROCS>, providing the number of processes to use via the -np option.

For good performance the product of OMP_NUM_THREADS and MPI processes should not exceed the number of available physical cores. So, for example, if your processor has 64 cores and a logical NUMA domain comprises four cores, you would run b2000++ with:

OMP_NUM_THREADS=4 mpiexec -np 16 b2000++

To make use of all 64 cores (each of the 16 MPI processes will make use of 4 OpenMP threads).

For increased performance, it may be beneficial to pin the MPI processes to specific cores, such that the operating system does not shuffle the processes around and moves their execution from one core to another. In the example above, we would like to pin the MPI processes to dedicated NUMA domains, so they can not be moved out of those. How this pinning can be done depends on your system. OpenMPI uses the hwloc library for this, and you can use its tools to query the hardware layout of your machine. See the --bind-to option for OpenMPI’s mpiexec command. To pin the MPI processes to NUMA domains in the example above, use:

OMP_NUM_THREADS=4 mpiexec -np 16 --bind-to numa b2000++

Note: sometimes it may happen that the hwloc library is out of sync with your MPI installation, causing warnings due to incompatibilities. You can switch off the error messages from hwloc by setting the environment variable HWLOC_HIDE_ERRORS=2.

Partitioned mode

With the option -partitioned you can run b2000++ in partitioned mode, that is the mesh is distributed across MPI processes and each process only holds a part of the overall problem. If you do not use the -partitioned the root process will have to hold the complete problem, posing a bottleneck and limiting the overall problem size that you can compute, to that problem fitting on this single node. Thus, if you want to compute very large problems this option should allow you to distribute those to more compute ressources.