Frequenty Asked Questions and support for Parcels
Parcels (
Probably
A Really
Computationally
Efficient
Lagrangian
Simulator) is a set of Python classes and methods to create customisable particle tracking simulations using output from Ocean Circulation models. Parcels focusses on tracking of passive water parcels, as well as active particulates such as plankton,
plastic and
fish.
The Parcels code is licensed under an open source MIT license and can be downloaded from https://github.com/OceanParcels/parcels.
There is also an extensive documentation of all methods and classes in Parcels.
Parcels design overview
The figure above gives a brief overview of the most important classes and methods in Parcels and how they are related. Classes are in blue, methods are in green. Note that not all classes and methods are shown.
Tips on constructing FieldSet objects
Probably the trickiest bit to get right when starting with Parcels on your own model output is how to construct a
FieldSet
object.
The general method to use is
FieldSet.from_netcdf()
, which requires
filenames
,
variables
and
dimensions
. Each of these is a dictionary, and
variables
requires at least a
U
and
V
, but any other
variable
can be added too (e.g. temperature, mixedlayerdepth, etc). Note also that
filenames
can contain wildcards.
For example, the GlobCurrent data that is shipped with Parcels can be read with:
fname = 'GlobCurrent_example_data/*.nc'
filenames = {'U': fname, 'V': fname}
variables = {'U': 'eastward_eulerian_current_velocity', 'V': 'northward_eulerian_current_velocity'}
dimensions = {'lat': 'lat', 'lon': 'lon', 'depth': 'depth', 'time': 'time'}
fset = FieldSet.from_netcdf(filenames, variables, dimensions)
Note that
dimensions
can also be a dictionary-of-dictionaries. For example, if you have wind data on a completely different grid (and without
depth
dimension), you can do:
fname = 'GlobCurrent_example_data/*.nc'
wname = 'path_to_your_windfiles'
filenames = {'U': fname, 'V': fname, 'wind': wname}
variables = {'U': 'eastward_eulerian_current_velocity', 'V': 'northward_eulerian_current_velocity', 'wind': 'wind'}
dimensions = {}
dimensions['U'] = {'lat': 'lat', 'lon': 'lon', 'depth': 'depth', 'time': 'time'}
dimensions['V'] = {'lat': 'lat', 'lon': 'lon', 'depth': 'depth', 'time': 'time'}
dimensions['wind'] = {'lat': 'latw', 'lon': 'lonw', 'time': 'time'}
fset = FieldSet.from_netcdf(filenames, variables, dimensions)
In a similar way, you can add
U
and
V
fields that are on different grids (e.g. Arakawa C-grids). Parcels will take care under the hood that the different grids are dealt with properly.
Writing Parcels Kernels
One of the most powerful features of Parcels is the ability to write custom Kernels (see e.g. the
Adding-a-custom-behaviour-kernel part of the Tutorial). These Kernels are little snippets of code that get executed by Parcels, giving the ability to add ‘behaviour’ to particles.
However, there are some key limitations to the Kernels that everyone who wants to write their own should be aware of:
- Every Kernel must be a function with the following (and only those) arguments:
(particle, fieldset, time)
(note that before Parcels v2.0, Kernels also required a dt
argument)
- In order to run successfully in JIT mode, Kernel definitions can only contain the following types of commands:
- Basic arithmetical operators (
+
, -
, *
, /
, **
) and assignments (=
).
- Basic logical operators (
<
, ==
, !=
, >
, &
, |
). Note that you can use a statement like particle.lon != particle.lon
to check if particle.lon
is NaN (since math.nan != math.nan
)
if
and while
loops, as well as break
statements. Note that for
-loops are not supported in JIT mode
- Interpolation of a
Field
from the FieldSet
at a (time, depth, lat, lon)
point, using using square brackets notation.
Note that from version 2.0, the syntax has changed from the old (time, lon, lat, depth) to the new (time, depth, lat, lon) order.
For example, to interpolate the zonal velocity (U) field at the particle location, use the following statement:
value = fieldset.U[time, particle.depth, particle.lat, particle.lon]
Also note that it is possible to add the particle
as an additional argument to the Field Sampling, so
value = fieldset.U[time, depth, lat, lon, particle]
or simply
value = fieldset.U[particle]
Adding the particle to the sampling can dramatically speed up simulations in Scipy-mode on Curvilinear Grids, see als the JIT-vs_Scipy tutorial.
- Functions from the
maths
standard library.
- Functions from the custom
ParcelsRandom
library at parcels.rng
. Note that these have to be used as ParcelsRandom.random()
, ParcelsRandom.uniform()
etc for the code to compile.
- Simple
print
statements, such as:
print("Some print")
print(particle.lon)
print("particle id: %d" % particle.id)
print("lon: %f, lat: %f" % (particle.lon, particle.lat))
- Local variables can be used in Kernels, and these variables will be accessible in all concatenated Kernels. Note that these local variables are not shared between particles, and also not between time steps.
- Note that one has to be careful with writing kernels for vector fields on Curvilinear grids. While Parcels automatically rotates the U and V field when necessary, this is not the case for for example wind data. In that case, a custom rotation function will have to be written.
All other functions and methods are not supported yet in Parcels Kernels. If there is a functionality that can not be programmed with this limited set of commands, please create an
Issue ticket.
The information on particle trajectories is stored in
zarr format, when using the
ParticleFile
class. The file/directory contains arrays of at least the
time
,
lon
,
lat
and
z
(depth) of each particle trajectory, plus any extra custom
Variables
defined in the
pclass
.
Each row in these arrays corresponds to a particle, and each column is an 'observation'. Note that, when particles are added during runtime, their first position is still stored in the first column, so that not all particles in the same column necessarily share the same
time
. The time of each observation is stored in the
time
array. See
the output tutorial notebook for more information.
Parcels, with its Lagrangian simulation approach and the commonly large amount of particles being simulated, is a computationally-demanding application. Global scale, fine-resolution hydronamic input data, such as from CMEMS and NEMO, add to this computational demand as they require large blocks of memory to be available. Despite the continuous improvements in performance and efficiency, it is thus possible to encounter performance- and memory-related problems. The following paragraphs are meant as a guideline to resolve those related issues.
If you encounter random failures of your particle simulations, the first step is to simplify the simulation in order to make the error reproducible for external persons. Therefore, it is advisable to run the simulations with the simplest type of particles (ideally: common Particle objects) and possibly just a simple AdvectionRK4 kernel. This rules out any errors coming from faulty calculations. In case this simplification resolves your issue, then the failure of your simulation is due to the specific particle or kernel you attempt to execute, and you may want to refine your calculations accordingly.
When calculation-related cancellations are ruled out as error source, the first step in tracking memory-related issues is a reasonability analysis: can issues that you are experiencing actual stem from memory exhaustion? Get an overview of the models and fields that you are using, consider if you're running the application with- or without depth information, and keep in mind that at each time in the simulation you need at least space for 3 timestamps of all numerical field values in addition to the actual particles. Does this really exhaust the memory? As a safeguard check, you can run a subset of your simulation (e.g the first day or week) locally while tracking the memory consumption with tools such as the
Task Manager (Windows), the
Activity Monitor (MacOS) or the
System Monitor (Linux). If you can, run the simplified simulation over the whole timespan locally on your computer (even though this may be slow).
Suppose that you have determined that reason for your simulation not finishing
can be or
is memory exhaustion. Then, the next step is to determine if the cause of memory exhaustion is due to the particles or due to the fields. Fortunately, this is easy to assess:
- Initialize your ParcileSet with only few particles, e.g.
pset = ParticleSet(fieldset=fieldset, pclass=JITParticle, lon=np.random.rand(16,1), lat=np.random.rand(16,1))
- Do not re-add particles periodically via
repeatdt
, e.g. change pset = ParticleSet(..., lat=np.random.rand(16,1), repeatdt=timedelta(hours=1))
to pset = ParticleSet(..., lat=np.random.rand(16,1))
Run your simulation again and observe the memory with the tools mentioned above. If your simulation runs through without problems, it is almost certain your real simulation requires too many particles. In that case, an applicable general advice is to reduce the overall particle density (i.e. creating less particles overall), or manually split your simulation in smaller pieces and reduce the particle count accordingly, if you want to preserve the particle density. Apart from this advice, there is no generic solution at this point for the problem of exhaustive particle sets, so: Get in touch with us at
the OceanParcels GitHub page via the issue tracker to find an individual solution.
If the reduction of particles indeed does not solve your issue and your simulation still crashes, then this is likely due to the fields consuming more memory than available. Luckily, there are some solutions to this problem.
Introduce field chunking (i.e. dynamic loading of fields)
So, your situation is as follows: you found out that your Parcels simulation consumes more field data than can fit in memory on any system that you tried. Then, it may be that you are not using field chunking.
Field chunking is introduced in Parcels > 2.1.0. It manages the access and data transfer between your hydrodynamic data (i.e. your NetCDF files), the computer memory, and the kernels, in a way so that only the parts of the data ends up in memory which are directly needed by some particles. This technique is commonly referred to as
dynamic loading. In Python, with its numeric backend of Numpy and SciPy, the
Dask module implements this data access technique, which is called
chunking or
lazy loading within Dask. Chunking for field data is activated by
default in Parcels > 2.1.0.
As an example, let us assume we load a field set as such:
fieldset = FieldSet.from_netcdf(files, variables, dimensions)
then, this is equivalent to
fieldset = FieldSet.from_netcdf(files, variables, dimensions, deferred_load=True, field_chunksize='auto')
Hence, if you are using an older version of Parcels (<= 2.1.0), or you have accidentally switched of chunking via
fieldset = FieldSet.from_netcdf(..., field_chunksize=False)
, then you may exceed your available memory. Try out the latest version of Parcels and
activate field chunking.
Simulations crash on HPC clusters (via job submission systems such as SGE or SLURM), but locally finish without problems
The behaviour of simulations crashing on cluster systems whereas running as-expected on local machines can occur because
- Python itself does not return memory to the operation system
- Your (field) data are too large for the memory, but fit comfortably in virtual memory (e.g. swap drive, swap files, page files) =-> job submission systems do not provide access to the swap memory
- Your (field) data exceed the default memory capacity granted to users of your cluster -> consult tutorials on SGE qsub (e.g. from MIT and University of Austin / Texas) or SLURM (e.g. from Princeton University or University of Buffalo) to raise the granted memory to your simulation job. In particular, consider submission option parameters such as -l h_vmem=size (SGE qsub) and
--mem=size (SLURM).
Advanced control over chunking and memory - configure Dask
In the case that you have field chunking running, your job submission files are all set correctly, but your fields are still too big, then your fields must be either many (in terms of
individual fields), or they are complex (i.e. many diverse, hydrological properties), or they are in dense (i.e.
fine-resolution) 4D grids (i.e. using time, depth, latitude and longitude).
In this case, you may need to tune your environment a bit more directly to your application or scenario.
- Tune your chunk size configuration manually: Instead of leaving the chunking up to the black-box that is Dask, you can define your chunks yourself. Say you have simulation data of 30 timesteps on a grid with depth=150, lat=2100, lon=1800 (just an example), then you may wanna define the chunks yourself.
- use the parameter field_chunksize of the FieldSet class as a tuple: here, the order is
(time steps, depth size, latitude size, longitude size)
, so e.g. fieldset = FieldSet.from_netcdf(files, variables, dimensions, field_chunksize=(1,8,128,128))
- use the parameter field_chunksize of the FieldSet class as a dictionary: here, you define the chunks by name, such that e.g.
fieldset = FieldSet.from_netcdf(files, variables, dimensions, field_chunksize={'time_counter': 1, 'depth': 8, 'nav_lat': 128, 'nav_lon': 128})
- Override Dask's chunking procedure: Dask's auto-chunking feature attempts to minimize read-access from the file, hence it will chunk the data so that an individual chunk fits into memory. If you are working with multiple fields (or: multiple cores via MPI), then this assumption doesn't apply to you and, thus, the parameter setting
field_chunksize='auto'
may not provided you with the expected result. If you want to adapt this behaviour without manually tuning every application and scenario script, then you can configure the auto-chunking function itself.
Information on how to adapt the auto-chunking can be found in the Dask documentation. Dask itself can use a term in the Dask configuration file, which defines a new upper limit for a chunked array. Detailed information and guidance on the configuration file can be found in the Dask guidelines.
As a short summary: when having installed Dask, there is a (hidden) user-defined folder in your home directory ${HOME}/.config/dask
that has two files: dask.yaml
and distributed.yaml
. The second one is of less interest for you unless you have distributed file management (if unsure, contact your system administrator). Under normal conditions, Dask will read configuration parameters from dask.yaml
. That file could look like this (default after installation):
# temporary-directory: null # Directory for local disk like /tmp, /scratch, or /local
# array:
# svg:
# size: 120 # pixels
You can adapt this so the chunks, for example, are smaller, which improves loading and allows multiple simultaneous fields being accessed without exceeding memory.
temporary-directory: /tmp
array:
svg:
size: 120 # pixels
chunk-size: 128 MiB
Another very important change is that we removed the #
from the start of the lines, which transforms them from being comments into actual information. Of course, if you have e.g. 4 fields, then you may want to change 128 MiB into 32 MiB. Power-of-two sizes here are also advantageous, although not strictly necessary.