4. Array ports
In many cases, you need to process several data files using Anduril. This and the following sections introduce facilities for working with large data sets. The first question to address is how to provide multiple data files as an input for a component, particularly when the number of data files is not fixed? Anduril has a specific port category, array port, for this case. Any regular port type has an array counterpart, so you can have CSV port arrays, TextFile port arrays, etc.
Assume we have the following tab-delimited files (data1.csv and data2.csv):
Gene Value QualityOK
gene01 1.5 1
gene02 2.7 0
gene03 5.8 0
gene99 3.2 1
Gene Value QualityOK
gene01 2.1 0
gene02 0.3 1
gene03 3.6 1
gene99 1.4 1
In the following example, we concatenate the above CSV files into one CSV files using CSV array ports. CSVListJoin is a component that can either take individual CSV files or an array of CSV files as input.
Assume we have the following tab-delimited files (data1.csv and data2.csv):
Gene Value QualityOK
gene01 1.5 1
gene02 2.7 0
gene03 5.8 0
gene99 3.2 1
Gene Value QualityOK
gene01 2.1 0
gene02 0.3 1
gene03 3.6 1
gene99 1.4 1
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object ArrayPort {
val data1 = INPUT(path = "data1.csv")
val data2 = INPUT(path = "data2.csv")
val myData = Map("sample1" -> data1, "sample2" -> data2)
// Implicitly create an array index
val joined = CSVListJoin(in = subData)
// Explicit alternative
val joinedExplicit = CSVListJoin(in = makeArray(myData))
}
When executed, result_array/joined/out.csv
contains:
file Gene Value QualityOK
sample1 gene01 1.5 1
sample1 gene02 2.7 0
sample1 gene03 5.8 0
sample1 gene99 3.2 1
sample2 gene01 2.1 0
sample2 gene02 0.3 1
sample2 gene03 3.6 1
sample2 gene99 1.4 1
Understanding the workflow
We modeled our data set using a Scala Map instance that specifies unique
identifiers for each data file. This map is converted by Anduril into a CSV
array instance in the call to joined
. In many cases, Anduril can do this
conversion manually, but in cases Anduril can not automatically infer the need
for an array port, you can do it explicitly using the makeArray
(from
org.anduril.runtime
) function, as shown in joinedExplicit
.
To understand array ports, it is helpful to know the file system layout related to them. When Anduril creates an array from non-array inputs, it writes an array index file that contains unique keys and filenames. In our case, it looks like the following:
Key File
sample1 /home/user/data/data1.csv
sample2 /home/user/data/data2.csv
Components that support array inputs, such as CSVListJoin, read these index files and process all files defined in the array.
Keys allow tracking files using human-readable names. In our case, we defined the keys in the Scala Map definitions, they were used by Anduril to construct the array port index, and CSVListJoin placed them to the output to see which row comes from which CSV files (this functionality is configurable).
Extracting sub-maps from components having multiple outputs
Port arrays are mappings from keys to files, and thus an additional step is needed when constructing arrays from components that have multiple output ports. The previous example did not need this step, because INPUT has only one output port and thus INPUT instances can automatically be interpreted as ports.
The standard Scala method Map.mapValues
is useful for extracting specific
output ports into a sub-map. The following example explicitly extracts the
out
port and creates a CSV array from sample identifiers to extracted ports:
val subData = myData mapValues { _.out }
val joined = CSVListJoin(in = subData)
Passing arrays between components
In addition to creating array instances from Scala Maps, arrays can be created as the output of components. This is indicated in their documentation page. Array ports are passed between components just like regular ports.