Cluster & Docker deployment

Anduril can be flexbily deployed on a local machine (laptop / desktop) or in a cluster. The default settings are suitable for local deployment: parallel component execution is capped at four threads, and component invocation is done locally. Cluster deployment is needed when analyzing large data sets. Docker is useful to encapsulate external dependencies of components.

The main configuration options for deployment are the following flags to anduril run:

For convenience, the deployment flags can be inserted into the workflow Scala code as special comments in the header of the file:

#!/usr/bin/env anduril
//$OPT --wrapper my-wrapper.sh
//$OPT --threads 16

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._

object DeploymentDefaults {
    // Code goes here
}

Deployment configurations

Below are ready-made configurations for common deployments. Ensure that the wrapper scripts are on PATH by including $ANDURIL_HOME/bin on PATH. You can customize the scripts to your environment.

Slurm

Command line arguments: --wrapper anduril-wrapper-slurm --threads 99

The Slurm wrapper uses srun. By default it submits all components to Slurm. It can be customized by bypassing Slurm (executing locally), and managing CPU and memory resources. For more fine-grained control, you can add more custom attributes to anduril-wrapper-slurm.

Customizing:

val component = MyComponent()
component._custom("cpu") = "4"       // --cpus-per-task
component._custom("memory") = "2048" // --mem=MB

val local = MyComponent()
component._custom("host") = "local"  // Bypasses Slurm and executes locally

SGE

Command line arguments (SGE): --wrapper "qrsh -now no" --threads 99

Docker

Command line arguments: --wrapper anduril-wrapper-docker

This wrapper runs components inside a selected Docker container. The image is specified with a _custom("docker") annotation; components without that annotation are executed on the host without Docker.

The container should have an Anduril installation available and have ANDURIL_HOME set. The anduril-wrapper-docker wrapper sets the USER_ID environment variable to current user ID to handle file permissions correctly. This requires that the image has a entry point that switches to a local user having USER_ID. The anduril/core image (and images derived from it) support this feature.

Scala workflow:

val component = MyComponent()
component._custom("docker") = "repository/dockerimage"

To use your own or third party Docker images that do not have Anduril installed, you can map the $ANDURIL_HOME folder from the host to the container and set the ANDURIL_HOME environment variable in a custom wrapper script.

Cluster topology

There are two types of executables in Anduril: The core workflow engine, which is invoked using anduril, and components, which can be implemented using a variety of languages (R, Python, Java, etc.). The core is only needed on the node where anduril is interactively invoked (e.g., a head node). Wrapper scripts then forward execution to worker nodes.

The interactive head node requires the following software and files:

Worker nodes require the following software:

Since the Anduril installation, component bundles and execution folder are required both on the interactive and worker nodes, it is convenient to share them using a shared or distributed file system (e.g., NFS). Mount points should be the same on all hosts, so that file names work portably.

The interactive anduril process needs to remain in memory during the workflow execution, so it should be executed from screen or tmux for long jobs.

Writing custom wrappers

If you want to integrate Anduril with your specific cluster environment, and the scripts above are not suitable, you can write your own wrapper script. The interface of a wrapper is:

The following environment variables are available in the wrapper:

Since wrappers blocks until the component is executed, wrappers can control the concurrency of the workflow in conjunction with --threads N. For example, if the cluster environment controls job queue length, --threads can be set to a large value (such as 99) and let the cluster limit concurrency.

Synchronizing file systems

Distributed and parallel file systems may implement caching that does not ensure a synchronized view of the file system on the interactive and worker nodes. In this case, you may need to implement explicit I/O synchronization commands in the wrapper script, or modify file system caching parameters.

A symptom of an unsynchronized file system is that anduril running on the interactive node gives an error about missing output files of a component, even though the component execution finishes successfully on the worker node. When you manually view the contents of the execution folder using ls, the files apparently are there.

This issue can be fixed using the following methods: