Implementation: Composite

Let’s add a second component to the demo bundle. This component is a preprocessor for a simple fictional GeneCSV format, which is a CSV file with columns Gene, Value and QualityOK. The preprocessor takes a CSV file as input, removes rows that have QualityOK = false, and optionally can select a subset of genes. We also wish to compute various statistics for the unfiltered CSV file. We call the component PreprocessGeneCSV. We choose to implement PreprocessGeneCSV as a composite component, because the logic can be divided into filtering and statistics computation parts.

Example interface

The interface definition (~/my-bundles/demo/functions/PreprocessGeneCSV/component.xml) is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<component>
    <name>PreprocessGeneCSV</name>
    <version>1.0</version>
    <doc>
        Filter columns from Gene CSV file by name using inclusion or exclusion.
    </doc>
    <author email="author.email@example.com">Author Name</author>
    <inputs>
        <input name="in" type="CSV">
            <doc>Gene CSV file with columns Gene, Value and QualityOK.</doc>
        </input>
    </inputs>
    <outputs>
        <output name="out" type="CSV">
            <doc>Preprocessed CSV file.</doc>
        </output>
        <output name="statistics" type="CSV">
            <doc>Statistics about the input file.</doc>
        </output>
    </outputs>
    <parameters>
        <parameter name="excludeBadQuality" type="boolean" default="true">
            <doc>
                If true, rows having QualityOK=0 are excluded from output.
            </doc>
        </parameter>
        <parameter name="includeGenes" type="string" default=".*">
            <doc>
                Regular expression that matches gene identifiers to be included.
            </doc>
        </parameter>
    </parameters>
</component>

The format is exactly the same as for atomic components, except for the <launcher> element that is not used here. Instead, Anduril knows to look for the code in function.scala under the component folder.

Scala implementation

The file ~/my-bundles/demo/functions/PreprocessGeneCSV/function.scala contains:

def PreprocessGeneCSV(in: CSV, excludeBadQuality: Boolean, includeGenes: String): (CSV,CSV) = {
    val hasFilter = (includeGenes != ".*") || excludeBadQuality;
    var filteredCSV: CSV = in
    if (hasFilter) {
        val regexp: String =
            "Gene=%s".format(includeGenes) +
            (if (excludeBadQuality) ",QualityOK=1" else "")
        val filtered = CSVFilter(in, regexp = regexp)
        filteredCSV = filtered.out
    }

    val statisticsScript = INPUT(path = "statistics.r")
    val statistics = REvaluate(statisticsScript, in)

    (filteredCSV, statistics.table)
}

Composite components are implemented using Scala functions. The file function.scala is not a standalone Scala file, but rather a fragment that is inserted into a Scala template. Otherwise, the function is much like what you might write in a workflow definition file.

We used CSVFilter and REvaluate from the tools bundle, which is why we needed to declare a dependency in bundle.xml. We optimized the component so that it only places CSVFilter when needed. Thus, the set of atomic components we place on the workflow is not constant, and the component can be seen as a template rather than a fixed workflow.

You can use helper functions defined as private def in function.scala to avoid making the main function too long.

Local resources and relative paths

To implement the statistics computation, we place the code into an external R file (statistics.r) that is imported into the workflow (the code is omitted for simplicity). When using a local file name like we did here, the working directory is the folder containing the component, i.e., ~/my-bundles/demo/functions/PreprocessGeneCSV. Thus, we can place resource files into the component folder and import them using relative paths. The same applies to atomic components.

Signature of the Scala function

The name and signature of the function must match the interface declared in component.xml. Input ports and parameters must be defined in the same order, and all input ports come before parameters in the function signature. Mapping between parameter types in component.xml and Scala is as follows:

Interface type Scala type
boolean Boolean
float Double
int Int
string String

The return type is a tuple that contains one element for each output port of the component. In our case, the component has two CSV output ports, so the return type is (CSV, CSV).

Array ports are specified using the base type in Scala (example: CSV); they do not have a generic counterpart like Array[CSV] in Scala.