4. Array ports

In many cases, you need to process several data files using Anduril. This and the following sections introduce facilities for working with large data sets. The first question to address is how to provide multiple data files as an input for a component, particularly when the number of data files is not fixed? Anduril has a specific port category, array port, for this case. Any regular port type has an array counterpart, so you can have CSV port arrays, TextFile port arrays, etc.

In the following example, we concatenate CSV files into one CSV files using CSV array ports. CSVListJoin is a component that can either take individual CSV files or an array of CSV files as input.

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._

object ArrayPort {
  val data1 = INPUT(path = "data1.csv")
  val data2 = INPUT(path = "data2.csv")
  val myData = Map("sample1" -> data1, "sample2" -> data2)

  // Implicitly create an array index
  val joined = CSVListJoin(in = myData)
  // Explicit alternative
  val joinedExplicit = CSVListJoin(in = makeArray(myData))

When executed, result_array/joined/out.csv contains:

file        Gene    Value   QualityOK
sample1     gene01  1.5     1
sample1     gene02  2.7     0
sample1     gene03  5.8     0
sample1     gene99  3.2     1
sample2     gene01  2.1     0
sample2     gene02  0.3     1
sample2     gene03  3.6     1
sample2     gene99  1.4     1

Understanding the workflow

We modeled our data set using a Scala Map instance that specifies unique identifiers for each data file. This map is converted by Anduril into a CSV array instance in the call to joined. In many cases, Anduril can do this conversion manually, but in cases Anduril can not automatically infer the need for an array port, you can do it explicitly using the makeArray (from org.anduril.runtime) function, as shown in joinedExplicit.

To understand array ports, it is helpful to know the file system layout related to them. When Anduril creates an array from non-array inputs, it writes an array index file that contains unique keys and filenames. In our case, it looks like the following:

Key         File
sample1     /home/user/data/data1.csv
sample2     /home/user/data/data2.csv

Components that support array inputs, such as CSVListJoin, read these index files and process all files defined in the array.

Keys allow tracking files using human-readable names. In our case, we defined the keys in the Scala Map definitions, they were used by Anduril to construct the array port index, and CSVListJoin placed them to the output to see which row comes from which CSV files (this functionality is configurable).

Extracting sub-maps from components having multiple outputs

Port arrays are mappings from keys to files, and thus an additional step is needed when constructing arrays from components that have multiple output ports. The previous example did not need this step, because INPUT has only one output port and thus INPUT instances can automatically be interpreted as ports.

The standard Scala method Map.mapValues is useful for extracting specific output ports into a sub-map. The following example explicitly extracts the out port and creates a CSV array from sample identifiers to extracted ports:

val subData = myData mapValues { _.out }
val joined = CSVListJoin(in = subData)

Passing arrays between components

In addition to creating array instances from Scala Maps, arrays can be created as the output of components. This is indicated in their documentation page. Array ports are passed between components just like regular ports.