7. Organizing workflows
Anduril supports the creation of workflows with large number of components and complex structure. Here we provide some suggestions for coding style to keep the code maintainable.
Basic structure: import, processing, export
A common basic workflow structure is to first import input files, then do the actual processing, and lastly to export the key output files to an easily identified location. Note that these three phases refer to code organization; workflow execution occurs in parallel.
We have already seen the INPUT component for importing local data files. Another import component is URLInput, which retrieves files over the Internet.
For exporting, Anduril provides OUTPUT, which copies (or creates symbolic links for) selected results into the output/
folder of the execution folder. Files are named according to the component and port. Using OUTPUT is optional, but makes it easier to locate the end results in a large workflow.
The following example shows an example of this structure. The workflow joins two CSV files and exports the joined CSV file to output/joined-out.csv
.
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object ThreePhase {
// Import
val data1 = INPUT(path = "data1.csv")
val data2 = INPUT(path = "data2.csv")
// Processing
val joined = CSVJoin(data1, data2, intersection = false)
// Output
OUTPUT(joined.out)
}
Encapsulating sub-workflows
Large workflows can be divided into more manageable sub-workflows using Scala functions. Here, sub-workflow means a part of the workflow that takes some files as input and produces some files as output. Encapsulation is visible at the Scala code level, but the workflow engine executes a flat workflow as before, though component names reflect the code structure.
Below is a CSV joining workflow that uses functions for importing and processing. This time, we have three CSV files (data1-3.csv) and want to produce two joins (data1.csv plus data2.csv, and data2.csv plus data3.csv). We also want to sort the joined files. We chose not to encapsulate exporting because it is a trivial step. The importData
function takes no files as input and produces a map of input files as result.
The processing step is encapsulated into joinCSV
that takes two CSV files as input and produces one CSV file as output. This step greatly benefits from function encapsulation, because joinCSV
is reused to make the two joins. Also, joinCSV
is easier to maintain because it is abstracted from the rest of the workflow (e.g., it does not need to know the names of the inputs).
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object JoinFunctions {
def importData(): NamedMap[CSV] = {
val allData = NamedMap[CSV]("data")
allData("data1") = INPUT(path = "data1.csv")
allData("data2") = INPUT(path = "data2.csv")
allData("data3") = INPUT(path = "data3.csv")
allData
}
def joinCSV(file1: CSV, file2: CSV): CSV = {
val joined = CSVJoin(file1, file2, intersection = false)
val sorted = CSVSort(joined.out, types = "Gene=string")
return sorted.out
}
val data = importData()
val joined12 = joinCSV(data("data1"), data("data2"))
val joined23 = joinCSV(data("data2"), data("data3"))
OUTPUT(joined12)
OUTPUT(joined23)
}
When executed, the following components are created: joined12-joined
, joined12-sorted
, joined23-joined
and joined23-sorted
. Components obtain hierarchical names based on the function calls. This allows tracing between result files and code structure.
Using multiple Scala files
Large projects benefit from splitting the source into multiple files. This is easily done using the -s SOURCE.scala
flag to anduril run
.
First, we extract the joinCSV
function from our workflow, so that we can maintain it separately, and reuse in other projects. We have the following library source file in split-helper.scala
. Note that this file would not execute any components if executed by itself; we can also omit the shebang line.
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object JoinLibrary {
def joinCSV(file1: CSV, file2: CSV): CSV = {
val joined = CSVJoin(file1, file2, intersection = false)
val sorted = CSVSort(joined.out, types = "Gene=string")
return sorted.out
}
}
Then, we write the main workflow script that imports the library using -s split-helper.scala
. This flag is placed in a special comment in the header section of the Scala file so we don’t have to provide it explicitly.
#!/usr/bin/env anduril
//$OPT -s split-helper.scala
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object SplitMain {
def importData(): NamedMap[CSV] = {
val allData = NamedMap[CSV]("data")
allData("data1") = INPUT(path = "data1.csv")
allData("data2") = INPUT(path = "data2.csv")
allData("data3") = INPUT(path = "data3.csv")
allData
}
val data = importData()
val joined12 = JoinLibrary.joinCSV(data("data1"), data("data2"))
val joined23 = JoinLibrary.joinCSV(data("data2"), data("data3"))
OUTPUT(joined12)
OUTPUT(joined23)
}
Using precompiled JARs (advanced)
Whereas the -s
flag provides a convenient way to include a few Scala source files, anduril run
can also execute precompiled JAR (Java Archive) files that are produced using SBT or other Scala compilation facilities. Please refer to SBT documentation for instructions. In your build, ensure that $ANDURIL_HOME/anduril.jar and all necessary $ANDURIL_HOME/bundles/BUNDLE/BUNDLE.jar files are on CLASSPATH.
To execute a precompiled JAR file stored in workflow.jar, type: anduril run workflow.jar
. Your JAR file should contain an object with the main(args: String[])
method, which is the entry point to the workflow.
Running scripts before and after workflow
In some cases, you may need to run setup scripts before executing a workflow and/or cleanup scripts after the workflow. Although these could be wrapped in an encapsulating shell script, Anduril provides a convenient syntax for pre- and postprocessing.
When executing the following workflow, Anduril first runs //$PRE
scripts (in order), then it executes the workflow, and finally it runs the //$POST
scripts. Each of the script is executed in the shell and can invoke commands such as mounting file systems, copying files, and removing files. Pre- and post scripts can be combined with //$OPT
to pass additional arguments to anduril run
.
#!/usr/bin/env anduril
//$PRE echo Preprocessing 1
//$PRE echo Preprocessing 2
//$POST echo Postprocessing 1
//$POST echo Postprocessing 2
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object PrePost {
// ...
}