Dask User Guide

This user guide is also available as a Jupyter Notebook. Download Jupyter Notebook

Table of Contents

Set up Conda environment

If this is your first time using Dask, you need to install some packages. You only need to do it once.

Follow the steps on these pages to get started:

Set up Dask workloads on a local cluster

The following Python code snippet starts a local Dask cluster (on the Notebook server) and connects to it

from dask.distributed import Client, LocalCluster

# By default, the number of workers is automatically set based on the number of CPU cores
cluster = LocalCluster()

# Or, to specify a specific number of workers, set 'n_workers' to the number of workers desired
cluster = LocalCluster(n_workers=6)

client = Client(cluster)

Set up Dask workloads on a remote cluster

Usually, the address to the Dask scheduler is set by the DASK_SCHEDULER_ADDRESS environment variable.

Before continuing, check if this is set

import dask
dask.config.get('scheduler_address')

This should output the address of the Dask scheduler (address:port). If you don’t see the address or an error message is thrown, you should set the Dask scheduler address.

import dask

# Replace 'address:port' with the actual address and port number of the Dask scheduler
dask.config.set({'scheduler_address': 'address:port'})

Now, you can connect to the remote Dask cluster.

from dask.distributed import Client

client = Client()

Dask and Xarray operations

Reading netCDF files

Read one netcdf file as xarray Dataset

import xarray as xr

ds = xr.open_dataset("/path/to/netcdf.nc")

Read multiple netcdf files as xarray Dataset

import glob
import xarray as xr

# Step 1: build file list by matching paths with a given pattern
# Example: this matches all netcdf files with filenames that start 
# with 'month_' in /path/to/dataset
netcdf_list = glob.glob('/path/to/dataset/month_*.nc')

# Step 2: read matching netcdf file as xarray Dataset
ds = xr.open_mfdataset(paths=netcdf_list)

Basic operations

Load data in a dimension into xarray DataArray

da = ds[dim_name]

Calculate mean

# First group DataArray by some coordinate (such as month), then calculate mean for each group
da_mean = da.groupby('time.month').mean()

Resample data (downsample or upsample)

# Resample DataArray to monthly. Specify time coordinate, use [y]ear/[m]onth/[w]eek/[d]ay/[h]our
da_resample = da.resample(time='1m').mean()

Persist DataArray in memory on Dask workers (This speeds up computation)

from dask.distributed import progress

da_mean = da_mean.persist()

# Monitor Dask progress
progress(da_mean)

Plotting with matplotlib

matplotlib works well as a simple, standalone plot in Jupyter Notebook (legacy, not JupyterLab) environments.

However, when used in JupyterLab or with interactive elements (Slider or Player), the plots are not interactive. To create interactive plots in these environments, consider hvPlot.

matplotlib options

# When using matplotlib, set one of these options before generating plots

# In JupyterLab, use inline plots
%matplotlib inline

# In Jupyter Notebook (legacy), use notebook plots, which are interactive
%matplotlib notebook

Create a simple plot from DataArray

# figsize argument is optional
da_mean.plot(figsize=(width, height))

Create an interactive plot with an Integer slider

# Plot itself is not interactive when Player or Slider is added
import panel.widgets as pnw

# Step 1: create the Slider
# Replace 'month' with the variable to be controlled by slider, and set start/end appropriately
slider = pnw.IntSlider(name='month', start=1, end=12)

# Step 2: create plot
# Replace 'month' with the variable to be controlled by slider
da_mean.interactive.sel(month=slider).plot()

Create a looping plot with an Integer slider

# Plot is not interactive when Player or Slider is added
import panel.widgets as pnw

# Step 1: create the Player
# Replace 'month' with the variable to be controlled by player, and set start/end appropriately
# Include loop_policy='loop' to make the plot loop
# Set interval appropriately. Please test and account for computation time.
player = pnw.Player(name='month', start=1, end=12, loop_policy='loop', interval=1000)

# Step 2: create the plot
da_mean.interactive().isel(month=player).plot()

Create an interactive plot with a Discrete slider (for selecting datetime)

# Plot is not interactive when Player or Slider is added
import panel.widgets as pnw

# Create the plot with Discrete Slider
# Replace 'time' with the variable to be controlled by player
da_month.interactive.sel(time=pnw.DiscreteSlider).plot()

Plotting with hvPlot

hvPlot creates nice interactive plots with many features, including pan, zoom, and hover to view raw data.

Create an interactive plot with an Integer slider

import hvplot.xarray
import panel.widgets as pnw

# Step 1: create the Slider
# Replace 'month' with the variable to be controlled by slider, and set start/end appropriately
slider = pnw.IntSlider(name='month', start=1, end=12)

# Step 2: create plot
# Replace 'month' with the variable to be controlled by slider
da_mean.interactive.sel(month=slider).hvplot()

Create a looping plot with an Integer slider

import hvplot.xarray
import panel.widgets as pnw

# Step 1: create the Player
# Replace 'month' with the variable to be controlled by player, and set start/end appropriately
# Include loop_policy='loop' to make the plot loop
# Set interval appropriately. Please test and account for computation time.
player = pnw.Player(name='month', start=1, end=12, loop_policy='loop', interval=1000)

# Step 2: create the plot
# Replace 'month' with the variable to be controlled by player
da_mean.interactive.sel(month=player).hvplot()

Create an interactive plot with a Discrete slider (for selecting datetime)

import hvplot.xarray
import panel.widgets as pnw

# Create the plot with Discrete Slider
# Replace 'time' with the variable to be controlled by player
da_month.interactive.sel(time=pnw.DiscreteSlider).hvplot()