CLM was implemented to provide performance portability across a wide range of current and future computer architectures. The model runs on hardware ranging from laptop computers to the largest commercial supercomputers available, and it offers good performance on both vector and scalar architectures. The model can be run serially or in parallel using distributed or shared memory processors or a combination of distributed and shared memory processors. The model uses MPI (the Message Passing Interface) for distributed memory parallelism and OpenMP for shared memory parallelism. On the Cray X1, the model can also stream across processing units (called Single Streaming Processors or SSPs) within a Multi-Streaming Processor (MSP) based on Cray Streaming Directives (CSDs) included in the code.
When compiled for distributed memory parallelism (with CPP variable SPMD defined), each MPI process will create an instance of the data structures shown in Figure 3.2 containing only the subset of data assigned to that process. A cache-friendly blocking structure is superimposed on the data structure hierarchy for improved computational efficiency. This blocking structure implicitly controls the vector length of most computations. Gridcells are grouped into blocks (called ``clumps'') of nearly equal computational cost, and these clumps are subsequently assigned to MPI processes.
The computational cost of a gridcell is approximately proportional to the number of PFTs contained within it. However, since computational cost for some PFTs is higher than for others and since similar PFTs tend to cluster geographically, balancing the workload across MPI processes requires a more complex scheme than simply assigning contiguous blocks of gridcells to clumps. To minimize the potential for load imbalance, gridcells are assigned in cyclic (or round robin) fashion to a pre-determined number of clumps. The clumps are then assigned in cyclic fashion to available MPI processes. This scheme has proven to sufficiently distribute gridcells of various costs among MPI processes yielding very good parallel load balancing characteristics for most process counts and surface datasets. The clumped decomposition scheme is implemented in initDecomp (Section A.50.1).
Clumps not only define the workload for an MPI process, they also serve to block data for shared memory parallelism when using OpenMP. The number of clumps per MPI process is determined by the parallel configuration of the model at run time in control_init (Section A.49.1), but it may be set explicitly by setting the clump_pproc namelist variable to the desired number of clumps per process. When run serially or with MPI-only parallelism, one clump per process is used. When OpenMP is used, the number of clumps per process is set to the maximum number of OpenMP threads available. The maximum number of threads is normally declared by setting the OMP_NUM_THREADS environment variable in the jobscript to the desired number. On the Cray X1 when OpenMP is disabled, CSDs are interpreted by the compiler in place of OpenMP directives, and the number of clumps per process is set to four to take maximum advantage of the four SSP processing units on an MSP.