|
FlowSieve
3.0.0
Coarse Graining Routines
|
This file will outline some of the computational methodologies.
Suppose that we are computing the filter at longitude
and latitude
(at indices
and
respectively) over length scale
. Let
be the mean radius of the earth (in metres).
The integral is performed as a double for-loop, with the outer loop going through latitude and the inner loop going through longitude. In order to optimize the computation time, we would like to restrict the bounds on each of the for loops as much as possible.
This is implemented in Functions/get_lat_bounds.cpp
Suppose that we have uniform (spherical) grid spacing
.
The spacing between latitude bands, in metres, is then
. The number of points that we'll include in either latitude side is then
The scale factor
is a user-specified scaling to indicate how large of an integration region should be used. This is specified as KernPad in constants.hpp.
The outer loop in the integral is then from
to
.
In this event, we simply use a binary search routine to search the (logical) interval
for the point whose distance from the reference point is nearest to, but not less than, 
This is implemented in Functions/get_lon_bounds.cpp
The currently implementation requires the longitudinal grid to be uniform.
Suppose that we have uniform (spherical) grid spacing
.
Next, at each latitude
within that loop, the same process is applied to compute the bounds for the longitude loop, with the spacing between longitude band, in metres, given by
.
An identical equation is then used to compute
, which then gives the width of the integration region (in logical indexing) at that specific latitude.
Unit-wise,
The following assume incompressibility:
Not implemented.
Unit-wise, this gives
In the case that
, then this reduces to
The spherical differentiation methods (spherical derivatives) use a land-avoiding stencil. Land avoiding is achieved by simply using a non-centred stencil when appropriate. This is done to avoid having to specify field values at land cells, as this may introduce artificially steep velocity gradients that would confound derivatives, particularly with pressure and density, where there's no clear extension to land.
The secondary differentation tools (Cartesian derivatives) simply apply the chain rule on the spherical differentiation methods.
In the case that we're restricing ourselves to a spherical shell, then
, so this reduced down to
This section outlines some of the parallelizations that are used. The coarse graining routine uses hybrid parallelization with OpenMPI (forks) and OpenMP (threads). This is done to minimize communication costs as much as possible.
The OpenMPI-based limit is:
The OpenMP-based limit is then:
If we then assign one OpenMPI processor to each physical compute node, this gives the over-all upper-bound on the number of processors that can be used as **(# Processors per Node) * Ntime * Ndepth**, assuming that the number of processors per node is constant.
Suppose you have a dataset with daily resolution, spanning ten years, but only at the surface.
Then Ntime = 365 * 10 = 3650 (ignoring leap years) and Ndepth = 1.
Further suppose that your computing environment has 24 processors per node.
You could then theoretically use up to 87,600 ( = 3650 * 1 * 24) processors.
Moreover, this is theoretically without encurring onerous communication costs, since the OpenMPI forks only need to communicate / synchronize during I/O, which only happens once per filter scale and once at the very beginning of the routine.
The time and depth loops are split and across processors with OpenMPI. At current, only time is actually split, but it is intended that depth will also be split across processors. Since the filtering routine only uses horizontal integrals, there is no need to communicate between different times and depths. This helps to avoid costly OpenMPI communication.
Horizontal (lat/lon) loops are NOT parallelized with OpenMPI. This is because there is a lot of communication required, especially when using large filtering scales or sharp-spectral kernels. This communication would be prohibitive.
The filter scales are also NOT parallelized with OpenMPI. This is because it would be difficult to load balance, particularly with non-sharp spectral kernels, where different filter scales would require very different amounts of time / work. Accounting for this would be non-trivial and, unless there were a large number of filter scales, not practical or efficient.
If parallelization across filter scales is truly desired, the simplest way would simply be to produce multiple executables and run them separately.
The horizontal (lat/lon) loops are threaded (openmp) in for loops. Both dynamic and guided scheduling routines are used to divide the work over the threads. These are used to give load balancing in light of the land/water distinctions. OpenMP uses shared memory, which avoids the need for (potentially very high) communication costs. There is a price to pay with scheduling, particularly since we can't use static scheduling (for load balancing reasons), but it's far less than would arise from OpenMPI.
The following functions occur within a time/depth for loop, and so do not use OpenMPI reoutines.
The following functions occur outside of time/depth for loops. However, the appropriate values for Ntime and Ndepth are passed through, so there's no need for further OpenMPI routines (outside of using the processor rank for print statements, of which there are none!).
This means that there's almost no modifications required, except for:
If you're running on a SLURM-managed cluster (such as the ComputeCanada clusters, Bluehive, etc), then you can submit using hybridized parallelization using the following submit script.
#!/bin/bash #SBATCH -p standard #SBATCH –output=sim-j.log #SBATCH –error=sim-j.err #SBATCH –time=02-00:00:00 # time (DD-HH:MM:SS) #SBATCH –ntasks=10 # number of OpenMPI processes #SBATCH –cpus-per-task=10 # number of OpenMP threads per OpenMPI process #SBATCH –mem-per-cpu=450M # memory per OpenMP thread #SBATCH –job-name="job-name-here"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun -n ${SLURM_NTASKS} ./coarse_grain.x
1.8.13