5. Iteration

Often, we need to process multiple data files using similar methods for each file, and combine the processed files into one. Iteration facilities of Scala are useful for writing workflows for such data sets.

Let’s assume we have two tab-delimited files. data1.csv:

Gene    Value   QualityOK
gene01  1.5     1
gene02  2.7     0
gene03  5.8     0
gene99  3.2     1

data2.csv:

Gene    Value   QualityOK
gene01  2.1     0
gene02  0.3     1
gene03  3.6     1
gene99  1.4     1

If our data set is small, we can store metadata using a simple Scala Map, which gives human-readable sample identifiers mapped to file names. See below for a more scalable approach.

val samples = Map("sample1" -> "data1.csv", "sample1" -> "data2.csv")

A first (bad) attempt

A first attempt might be to filter each CSV file in a for-loop, store the outputs to a Scala Map, and join all filtered files after the loop. Code:

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map

object IterationBad {
    val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
    val filteredMap = Map[String, CSV]()

    for ((sampleID, filename) <- samples) {
        val input = INPUT(path = filename)
        val filtered = CSVFilter(input, regexp = "QualityOK=1")
        filteredMap(sampleID) = filtered.out
    }

    val joined = CSVListJoin(in = filteredMap)
}

When executed, the workflow prints lines several lines, including the following two:

[INFO filtered] Executing filtered (anduril.tools.CSVFilter) (SOURCE iteration-bad.scala:12) (COMPONENT-STARTED) (2016-04-28 15:38:10)
[INFO filtered_anduril.tools.CSVFilter_1] Executing filtered_anduril.tools.CSVFilter_1 (anduril.tools.CSVFilter) (SOURCE iteration-bad.scala:12) (COMPONENT-STARTED) (2016-04-28 15:38:10)

After execution, the execution folder contains the correct result file:

file        Gene    Value   QualityOK
sample1     gene01  1.5     1
sample1     gene99  3.2     1
sample2     gene02  0.3     1
sample2     gene03  3.6     1
sample2     gene99  1.4     1

Why is this a bad solution, when it seems to work? Recall from earlier workflows that is has always been easy to see from the log messages and execution folder which Scala call created each component to the workflow. For example, val data = INPUT("data.sv") creates a component named data, based on the variable name. In our iteration case, Anduril does not have enough context inside the for-loop to generate easily traceable component names. The first iteration produces a component named filtered and the second filtered_anduril.tools.CSVFilter_1. It is difficult to guess which one correspondes to sample1 or sample2.

Proper naming using NamedMap / NamedSeq

Anduril runtime library (org.anduril.runtime) provides two data structures, NamedMap and NamedSeq, that behave like standard Scala Map and Seq, but provide legible names to components that are inserted into them. NamedMap is a mapping from strings to components, and should be used when string identifiers for data files are available. NamedSeq is a simpler version that uses integer indexes. The solution using NamedMap is:

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._

object IterationNamedMap {
    val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
    val inputMap = NamedMap[INPUT]("input")
    val filteredMap = NamedMap[CSVFilter]("filtered")

    for ((sampleID, filename) <- samples) {
        inputMap(sampleID) = INPUT(path = filename)
        filteredMap(sampleID) = CSVFilter(inputMap(sampleID), regexp = "QualityOK=1")
    }

    val joined = CSVListJoin(in = filteredMap)
}

Before the for-loop, two NamedMap objects are initialized, and they are given descriptive name prefixes ("input" and "filtered") based on the items inserted into them. Inside the for-loop, all component assignments are done using NamedMaps. Generated named are composed of the prefix given in NamedMap constructor, and the key given in the for-loop. The INPUT components are named input_sample1 and input_sample2, and CSVFilter components are filtered_sample1 and filtered_sample2.

Proper naming can be verified from execution logs, which include:

[INFO filtered_sample1] Executing filtered_sample1 (anduril.tools.CSVFilter) (SOURCE iteration-namedmap.scala:12) (COMPONENT-STARTED) (2016-04-28 16:14:35)
[INFO filtered_sample2] Executing filtered_sample2 (anduril.tools.CSVFilter) (SOURCE iteration-namedmap.scala:12) (COMPONENT-STARTED) (2016-04-28 16:14:35)

Syntactic sugar using withName

The solution using NamedMap exported all components from the for-loop to the surrounding code block (inputMap, filteredMap). In some cases, we may execute several internal steps inside the for-loop, but only wish to export a subset of them. For this case, Anduril provides a withName function that executes a code block in an environment that provides a name prefix for all components created in the block. Example:

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map

object IterationWithName {
    val samples = Map("sample1" -> "data1.csv", "sample2" -> "data2.csv")
    val filteredMap = Map[String, CSV]()

    for ((sampleID, filename) <- samples) {
        withName(sampleID) {
            val input = INPUT(path = filename)
            val filtered = CSVFilter(input, regexp = "QualityOK=1")
            filteredMap(sampleID) = filtered
        }
    }

    val joined = CSVListJoin(in = filteredMap)
}

withName is a function that takes a name prefix as an argument (here, sampleID) and inserts this prefix to all names created inside the given code block. The names follow the pattern PREFIX-INSTANCENAME. In our case, the names generated are sample1-input, sample2-input, sample1-filtered and sample2-filtered. CSVFilter instances are exported from the for-loop, and INPUT instances are hidden.

Note that here we can use a regular Scala Map instead of NamedMap to collect CSVFilter instances, because the instances are properly named using val filtered. It would be safe to use NamedMap here; it detects that components are already properly named and does not insert its own prefix. When in doubt, use NamedMap / NamedSeq.

Storing metadata in CSV files

If you more have than half a dozen data files, you probably want to store your metadata outside the Scala source file. A good option is CSV files, as Anduril provides convenient means of iterating over them. Of course, you can use any Scala iteration facilities you prefer. Let’s store metadata in metadata.csv which is located in the same folder as the Scala source file:

Sample  Filename
sample1 data1.csv
sample2 data2.csv

We can now iterate over the CSV file using iterCSV from org.anduril.runtime:

#!/usr/bin/env anduril

import anduril.builtin._
import anduril.tools._
import org.anduril.runtime._
import scala.collection.mutable.Map

object IterationCSV {
    val filteredMap = Map[String, CSV]()

    for (row <- iterCSV("metadata.csv")) {
        val sampleID = row("SampleID")
        val filename = row("Filename")
        withName(sampleID) {
            val input = INPUT(path = filename)
            val filtered = CSVFilter(input, regexp = "QualityOK=1")
            filteredMap(sampleID) = filtered
        }
    }

    val joined = CSVListJoin(in = filteredMap)
}