2. Linear workflow
The “Hello world” workflow in the previous section was trivial because it consisted of only one component. In this section, we construct a workflow using two components, which depend on each other.
Assume we have the following tab-delimited file (data.csv
):
Gene Value QualityOK
gene02 2.7 1
gene01 1.5 1
gene03 5.8 0
gene99 3.2 1
We wish to sort it using the Gene column, so that gene names are ordered. Scala code:
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object TwoComponents {
val input = INPUT(path = "data.csv")
val sorted = CSVSort(input.out, types = "Gene=string")
}
When executed, the workflow prints the following:
$ ./two-components.scala
[INFO <run-workflow>] Current ready queue: input (READY-QUEUE 1)
[INFO input] Executing input (anduril.builtin.INPUT) (SOURCE two-components.scala:8) (COMPONENT-STARTED) (2016-04-28 10:40:07)
[INFO input] Component finished with success (COMPONENT-FINISHED-OK) (2016-04-28 10:40:07)
[INFO input] Current ready queue: sorted (READY-QUEUE 1)
[INFO sorted] Executing sorted (anduril.tools.CSVSort) (SOURCE two-components.scala:9) (COMPONENT-STARTED) (2016-04-28 10:40:07)
[INFO sorted] Component finished with success (COMPONENT-FINISHED-OK) (2016-04-28 10:40:07)
[INFO sorted] Current ready queue: (empty) (READY-QUEUE 0)
[INFO <run-workflow>] Done. No errors occurred.
The file result_two-components/sorted/out.csv
now contains a sorted version
of the tab-delimited file:
Gene Value QualityOK
gene01 1.5 1
gene02 2.7 1
gene03 5.8 0
gene99 3.2 1
Understanding the workflow
This workflow uses two new components, INPUT and CSVSort. INPUT is a fundamental component that is used to import data files from the file system to the workflow. It is used in nearly every workflow. CSVSort is a more specialized component that takes a tab-delimited file as an input, sorts it, and writes a tab-delimited file as output.
Workflow execution proceeds as follows:
- In the beginning,
input
is ready to execute because it does not depend on other components.sorted
can not yet be executed. input
is executed.- All dependencies of
sorted
(i.e.,input
) are now available, so it is enabled for execution. sorted
is executed.
It is often helpful to mentally map the Scala source code to a workflow of component dependencies. In this case, we have the following dependency network:
Placing components on workflow using Scala
At this point, we need to understand how the Anduril Scala interface is
syntactically used to construct workflows. The components used in workflows
are provided by Anduril bundles, which are collections of components similar
to R or Python extension packages. Anduril components provide a uniform
interface through which they are used in Scala. The interfaces of components
consist of three things: input ports (files or folders read by the
component), output ports (files or folders written by the component) and
parameters (strings, numbers or Booleans that modify logic). Component
interfaces can be browsed
online and generated
locally using anduril build-doc
.
Instantiating the INPUT component
The line val input = INPUT(path = "data.csv")
places on the workflow an
instance of the INPUT component, which exists in a bundle called builtin. The
builtin bundle is imported using import anduril.builtin._
. The
anduril.builtin
package provides a Scala class called INPUT that
encapsulates the interface of this component, and a constructor function that
is used to place instances of INPUT on the workflow. Conceptually, the code in
the builtin bundle looks like the following:
/** Create an instance of the INPUT component, and place it on the workflow. */
def INPUT(path: String, recursive: Boolean = true): INPUT = {
// 1. Create an instance of INPUT class using given parameters
// 2. Place the instance on the workflow
// 3. Return the instance
}
/** Encapsulate the implementation of the INPUT component. Provide access
* to output ports of the component. */
class INPUT extends org.anduril.runtime.Component {
/** Handle to the imported file or folder. */
val out: org.anduril.runtime.Port = ...
...
}
INPUT has no input ports, has two parameters (path
and recursive
), and one
output port (out
). We create an instance by providing a value ("data.csv"
)
for path
; we can omit recursive
because it has a default value. After
creating the instance, we can access the handle of the imported file as
input.out
. Keep in mind that during workflow construction, input.out
does
not contain the actual contents of the file, because the workflow has not yet
been executed. Rather, it is a handle in the workflow configuration network.
Instantiating the CSVSort component
In a similar manner, val sorted = CSVSort(input.out, types = "Gene=string")
instantiates the CSVSort component. CSVSort lives in the tools bundle
(anduril.tools
). Its interface looks like the following:
def CSVSort(in: anduril.builtin.CSV, types: String = ""): CSVSort = { ... }
class CSVSort extends org.anduril.runtime.Component {
val out: anduril.builtin.CSV = ...
val status: anduril.builtin.TextFile = ...
...
}
type anduril.builtin.CSV = org.anduril.runtime.Port
type anduril.builtin.TextFile = org.anduril.runtime.Port
CSVSort has one input port, named in
, whose type is CSV. The type CSV is
provided by the builtin bundle. Port types are used by Anduril to verify the
correctness of workflows: this reduces errors such as trying to provide a ZIP
file for CSVSort. CSVSort also has a String parameter named types
(it
actually has other parameters as well, but they are omitted for simplicity).
It specifies the types of the columns for sorting, such as numeric or string.
From the documentation of CSVSort, we learn that the default sorting is
numeric, so we specify string sorting for the Gene
column for our workflow.
When creating the CSVSort instance, we need to provide a handle to a CSV file.
We obtain this from our INPUT instance. INPUT is a generic component that
can provide input files to any other component: its out
port is a valid
CSV file handle.
When the CSVSort instance is placed on the workflow, it is connected to the
INPUT instance to mark a dependency. This happens automatically when we
specify input.out
as the value for the in
argument.
CSVSort has two output ports: the sorted CSV file (out
), and a simple text
file (status
) that indicates whether the input was already sorted. The
latter has the type TextFile, provided by the builtin bundle. The output
corresponding to status
is in the file result_two-components/sorted/status.txt
.
Syntax variation for accessing output ports
There are a few syntactic alternatives to accessing the output ports of a
component (in this case, the out
port of input
):
input.out
accesses using a named field.input("out")
accesses using a string parameter. This is useful when the port name is accessed dynamically, or you don’t know the type of the component that produced it and cannot rely on the presence of the named field.input
without a port specifier works for components that have exactly one output port. In this case, the component is interpreted as a port in this context.input._1
(orinput._2
, etc.) accesses using position. The order is defined in the component documentation page.