2. Linear workflow

The “Hello world” workflow in the previous section was trivial because it consisted of only one component. In this section, we construct a workflow using two components, which depend on each other.

Assume we have the following tab-delimited file (data.csv):

Gene    Value   QualityOK
gene02  2.7     1
gene01  1.5     1
gene03  5.8     0
gene99  3.2     1

We wish to sort it using the Gene column, so that gene names are ordered. Scala code:

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._

object TwoComponents {
    val input = INPUT(path = "data.csv")
    val sorted = CSVSort(input.out, types = "Gene=string")
}

When executed, the workflow prints the following:

$ ./two-components.scala
[INFO <run-workflow>] Current ready queue: input (READY-QUEUE 1)
[INFO input] Executing input (anduril.builtin.INPUT) (SOURCE two-components.scala:8) (COMPONENT-STARTED) (2016-04-28 10:40:07)
[INFO input] Component finished with success (COMPONENT-FINISHED-OK) (2016-04-28 10:40:07)
[INFO input] Current ready queue: sorted (READY-QUEUE 1)
[INFO sorted] Executing sorted (anduril.tools.CSVSort) (SOURCE two-components.scala:9) (COMPONENT-STARTED) (2016-04-28 10:40:07)
[INFO sorted] Component finished with success (COMPONENT-FINISHED-OK) (2016-04-28 10:40:07)
[INFO sorted] Current ready queue: (empty) (READY-QUEUE 0)
[INFO <run-workflow>] Done. No errors occurred.

The file result_two-components/sorted/out.csv now contains a sorted version of the tab-delimited file:

Gene    Value   QualityOK
gene01  1.5     1
gene02  2.7     1
gene03  5.8     0
gene99  3.2     1

Understanding the workflow

This workflow uses two new components, INPUT and CSVSort. INPUT is a fundamental component that is used to import data files from the file system to the workflow. It is used in nearly every workflow. CSVSort is a more specialized component that takes a tab-delimited file as an input, sorts it, and writes a tab-delimited file as output.

Workflow execution proceeds as follows:

  1. In the beginning, input is ready to execute because it does not depend on other components. sorted can not yet be executed.
  2. input is executed.
  3. All dependencies of sorted (i.e., input) are now available, so it is enabled for execution.
  4. sorted is executed.

It is often helpful to mentally map the Scala source code to a workflow of component dependencies. In this case, we have the following dependency network:

Placing components on workflow using Scala

At this point, we need to understand how the Anduril Scala interface is syntactically used to construct workflows. The components used in workflows are provided by Anduril bundles, which are collections of components similar to R or Python extension packages. Anduril components provide a uniform interface through which they are used in Scala. The interfaces of components consist of three things: input ports (files or folders read by the component), output ports (files or folders written by the component) and parameters (strings, numbers or Booleans that modify logic). Component interfaces can be browsed online and generated locally using anduril build-doc.

Instantiating the INPUT component

The line val input = INPUT(path = "data.csv") places on the workflow an instance of the INPUT component, which exists in a bundle called builtin. The builtin bundle is imported using import anduril.builtin._. The anduril.builtin package provides a Scala class called INPUT that encapsulates the interface of this component, and a constructor function that is used to place instances of INPUT on the workflow. Conceptually, the code in the builtin bundle looks like the following:

/** Create an instance of the INPUT component, and place it on the workflow. */
def INPUT(path: String, recursive: Boolean = true): INPUT = {
    // 1. Create an instance of INPUT class using given parameters
    // 2. Place the instance on the workflow
    // 3. Return the instance
}

/** Encapsulate the implementation of the INPUT component. Provide access
 * to output ports of the component. */
class INPUT extends org.anduril.runtime.Component {
    /** Handle to the imported file or folder. */
    val out: org.anduril.runtime.Port = ...
    ...
}

INPUT has no input ports, has two parameters (path and recursive), and one output port (out). We create an instance by providing a value ("data.csv") for path; we can omit recursive because it has a default value. After creating the instance, we can access the handle of the imported file as input.out. Keep in mind that during workflow construction, input.out does not contain the actual contents of the file, because the workflow has not yet been executed. Rather, it is a handle in the workflow configuration network.

Instantiating the CSVSort component

In a similar manner, val sorted = CSVSort(input.out, types = "Gene=string") instantiates the CSVSort component. CSVSort lives in the tools bundle (anduril.tools). Its interface looks like the following:

def CSVSort(in: anduril.builtin.CSV, types: String = ""): CSVSort = { ... }

class CSVSort extends org.anduril.runtime.Component {
    val out: anduril.builtin.CSV = ...
    val status: anduril.builtin.TextFile = ...
    ...
}

type anduril.builtin.CSV = org.anduril.runtime.Port
type anduril.builtin.TextFile = org.anduril.runtime.Port

CSVSort has one input port, named in, whose type is CSV. The type CSV is provided by the builtin bundle. Port types are used by Anduril to verify the correctness of workflows: this reduces errors such as trying to provide a ZIP file for CSVSort. CSVSort also has a String parameter named types (it actually has other parameters as well, but they are omitted for simplicity). It specifies the types of the columns for sorting, such as numeric or string. From the documentation of CSVSort, we learn that the default sorting is numeric, so we specify string sorting for the Gene column for our workflow.

When creating the CSVSort instance, we need to provide a handle to a CSV file. We obtain this from our INPUT instance. INPUT is a generic component that can provide input files to any other component: its out port is a valid CSV file handle.

When the CSVSort instance is placed on the workflow, it is connected to the INPUT instance to mark a dependency. This happens automatically when we specify input.out as the value for the in argument.

CSVSort has two output ports: the sorted CSV file (out), and a simple text file (status) that indicates whether the input was already sorted. The latter has the type TextFile, provided by the builtin bundle. The output corresponding to status is in the file result_two-components/sorted/status.txt.

Syntax variation for accessing output ports

There are a few syntactic alternatives to accessing the output ports of a component (in this case, the out port of input):

  1. input.out accesses using a named field.
  2. input("out") accesses using a string parameter. This is useful when the port name is accessed dynamically, or you don’t know the type of the component that produced it and cannot rely on the presence of the named field.
  3. input without a port specifier works for components that have exactly one output port. In this case, the component is interpreted as a port in this context.
  4. input._1 (or input._2, etc.) accesses using position. The order is defined in the component documentation page.