Component types
Before implementing a component, we first need to decide the type of the component.
Atomic and composite components
There are two kinds of components: atomic and composite. Atomic components are written in a language such as R, Python, Bash, Java or Matlab, and roughly correspond to specialized versions of external scripts. In contrast, composite components are written in Scala and correspond to reusable Scala functions in workflows. When an atomic component is used in a workflow, it places exactly one component (namely, itself) to the workflow. A composite component can flexibly place one or more components to the workflow, some of which recursively may place multiple components.
Implement an atomic component when:
- You need libraries from external languages such as R or Python
- The execution logic is not easily dividable into independent or parallel parts
Implement a composite component when:
- You need to reuse functionality from other components
- You can split the execution of the component into two or more parts (workflow)
- In general, when you benefit from workflow features such as automatic parallelization and dependency tracking
As a general rule, if you would embed the logic of the component into a workflow using a single external script, it is a candidate for an atomic component. If you would encapsulate the logic in a Scala function in the workflow, a composite component is probably suitable.
Implementations of atomic component are under components/
in the bundle folder structure, and composite components are under functions/
.
Selecting component type for our first component
To add content to the demo
bundle, let’s implement a simplified CSV filter component that takes a CSV file as input, excludes certain columns, and writes a filtered CSV file as output. We name our component SimpleCSVFilter
.
By evaluating the criteria above, SimpleCSVFilter
is an atomic component because it is logically one unit and does not benefit from dividing functionality or workflow features.