5. Iteration
Often, we need to process multiple data files using similar methods for each file, and combine the processed files into one. Iteration facilities of Scala are useful for writing workflows for such data sets.
Let’s assume we have two tab-delimited files. data1.csv
:
Gene Value QualityOK
gene01 1.5 1
gene02 2.7 0
gene03 5.8 0
gene99 3.2 1
data2.csv
:
Gene Value QualityOK
gene01 2.1 0
gene02 0.3 1
gene03 3.6 1
gene99 1.4 1
If our data set is small, we can store metadata using a simple Scala Map, which gives human-readable sample identifiers mapped to file names. See below for a more scalable approach.
val samples = Map("sample1" -> "data1.csv", "sample1" -> "data2.csv")
A first (bad) attempt
A first attempt might be to filter each CSV file in a for-loop, store the outputs to a Scala Map, and join all filtered files after the loop. Code:
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map
object IterationBad {
val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
val filteredMap = Map[String, CSV]()
for ((sampleID, filename) <- samples) {
val input = INPUT(path = filename)
val filtered = CSVFilter(input, regexp = "QualityOK=1")
filteredMap(sampleID) = filtered.out
}
val joined = CSVListJoin(in = filteredMap)
}
When executed, the workflow prints lines several lines, including the following two:
[INFO filtered] Executing filtered (anduril.tools.CSVFilter) (SOURCE iteration-bad.scala:12) (COMPONENT-STARTED) (2016-04-28 15:38:10)
[INFO filtered_anduril.tools.CSVFilter_1] Executing filtered_anduril.tools.CSVFilter_1 (anduril.tools.CSVFilter) (SOURCE iteration-bad.scala:12) (COMPONENT-STARTED) (2016-04-28 15:38:10)
After execution, the execution folder contains the correct result file:
file Gene Value QualityOK
sample1 gene01 1.5 1
sample1 gene99 3.2 1
sample2 gene02 0.3 1
sample2 gene03 3.6 1
sample2 gene99 1.4 1
Why is this a bad solution, when it seems to work? Recall from earlier
workflows that is has always been easy to see from the log messages and execution
folder which Scala call created each component to the workflow. For example,
val data = INPUT("data.sv")
creates a component named data
, based on the
variable name. In our iteration case, Anduril does not have enough context
inside the for-loop to generate easily traceable component names. The first
iteration produces a component named filtered
and the second
filtered_anduril.tools.CSVFilter_1
. It is difficult to guess which one
correspondes to sample1 or sample2.
Proper naming using NamedMap / NamedSeq
Anduril runtime library (org.anduril.runtime
) provides two data structures,
NamedMap and NamedSeq, that behave like standard Scala Map and Seq, but
provide legible names to components that are inserted into them. NamedMap is a
mapping from strings to components, and should be used when string identifiers
for data files are available. NamedSeq is a simpler version that uses integer
indexes. The solution using NamedMap is:
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
object IterationNamedMap {
val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
val inputMap = NamedMap[INPUT]("input")
val filteredMap = NamedMap[CSVFilter]("filtered")
for ((sampleID, filename) <- samples) {
inputMap(sampleID) = INPUT(path = filename)
filteredMap(sampleID) = CSVFilter(inputMap(sampleID), regexp = "QualityOK=1")
}
val joined = CSVListJoin(in = filteredMap)
}
Before the for-loop, two NamedMap objects are initialized, and they are given
descriptive name prefixes ("input"
and "filtered"
) based on the items
inserted into them. Inside the for-loop, all component assignments are done
using NamedMaps. Generated named are composed of the prefix given in NamedMap
constructor, and the key given in the for-loop. The INPUT components are named
input_sample1
and input_sample2
, and CSVFilter components are
filtered_sample1
and filtered_sample2
.
Proper naming can be verified from execution logs, which include:
[INFO filtered_sample1] Executing filtered_sample1 (anduril.tools.CSVFilter) (SOURCE iteration-namedmap.scala:12) (COMPONENT-STARTED) (2016-04-28 16:14:35)
[INFO filtered_sample2] Executing filtered_sample2 (anduril.tools.CSVFilter) (SOURCE iteration-namedmap.scala:12) (COMPONENT-STARTED) (2016-04-28 16:14:35)
Syntactic sugar using withName
The solution using NamedMap exported all components from the for-loop to the
surrounding code block (inputMap
, filteredMap
). In some cases, we may
execute several internal steps inside the for-loop, but only wish to export a
subset of them. For this case, Anduril provides a withName
function that
executes a code block in an environment that provides a name prefix for all
components created in the block. Example:
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map
object IterationWithName {
val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
val filteredMap = Map[String, CSV]()
for ((sampleID, filename) <- samples) {
withName(sampleID) {
val input = INPUT(path = filename)
val filtered = CSVFilter(input, regexp = "QualityOK=1")
filteredMap(sampleID) = filtered
}
}
val joined = CSVListJoin(in = filteredMap)
}
withName
is a function that takes a name prefix as an argument (here,
sampleID
) and inserts this prefix to all names created inside the given code
block. The names follow the pattern PREFIX-INSTANCENAME. In our case, the
names generated are sample1-input
, sample2-input
, sample1-filtered
and
sample2-filtered
. CSVFilter instances are exported from the for-loop, and
INPUT instances are hidden.
Note that here we can use a regular Scala Map instead of NamedMap to collect
CSVFilter instances, because the instances are properly named using val
filtered
. It would be safe to use NamedMap here; it detects that components
are already properly named and does not insert its own prefix. When in doubt,
use NamedMap / NamedSeq.
Storing metadata in CSV files
If you more have than half a dozen data files, you probably want to store your
metadata outside the Scala source file. A good option is CSV files, as Anduril
provides convenient means of iterating over them. Of course, you can use any
Scala iteration facilities you prefer. Let’s store metadata in metadata.csv
which
is located in the same folder as the Scala source file:
Sample Filename
sample1 data1.csv
sample2 data2.csv
We can now iterate over the CSV file using iterCSV
from
org.anduril.runtime
:
#!/usr/bin/env anduril
import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map
object IterationCSV {
val filteredMap = Map[String, CSV]()
for (row <- iterCSV("metadata.csv")) {
val sampleID = row("SampleID")
val filename = row("Filename")
withName(sampleID) {
val input = INPUT(path = filename)
val filtered = CSVFilter(input, regexp = "QualityOK=1")
filteredMap(sampleID) = filtered
}
}
val joined = CSVListJoin(in = filteredMap)
}