Filters columns and/or rows from CSV files using flexible criteria.
Columns are filtered based on column names (see "includeColumns"); it is also possible to rename columns (see "rename"). Columns can be reordered using "includeColumns".
Rows can be filtered using several criteria. A row is included in output if all conditions are satisfied. Rows having too many missing values are filtered using the "nonMissing" parameter. The parameters "regexp", "lowBound" and "highBound" filter rows based on cell contents.
If the "auxiliary" input is given, the result contains those rows from the "csv" input file whose values match those from the auxiliary file. Column names for matching are specified using "idColumn" and "matchColumn".
Default settings do not filter out anything. The "negate" parameter reverses all inclusion criteria. The component reads the "csv" input one line at a time and is memory-efficient for files with a large number of lines.
Version | 1.1.1 |
---|---|
Bundle | tools |
Categories | Filter |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi), Marko Laakso (Marko.Laakso@Helsinki.FI) |
Issue tracker | View/Report issues |
Source files | component.xml CSVFilter.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | CSV | Mandatory | CSV file to be filtered. |
aux | CSV | Optional | If given, contains one column (see "matchColumn") whose values are matched to a column in the "csv" input (see "idColumn"). |
includeColumnsFile | IDList | Optional | Column names that should be included in the result file. This input is used together with includeColumns parameter. The actual list represents their union. The order of column names is significant: the output columns are in the order given. Order of this input overrides the parameter in case of duplicates. |
Name | Type | Description |
---|---|---|
out | CSV | Filtered CSV file. |
Name | Type | Default | Description |
---|---|---|---|
colOrder | boolean | false | Enables including columns from the includeColumns parameter before the columns in the includeColumns input. Secify whether to include columns specified by the includeColumns parameter last (default) or first. See also includeColumns input and includeColumns parameter. |
highBound | string | "" | A row is included if the value in given numeric columns are at most as large as given numeric bounds. The parameter has a format COLNAME1=HIGH1,COLNAME2=HIGH2. |
idColumn | string | "" | Column name in "csv" that contains ID values. If the parameter is empty, the first column is used. |
includeColumns | string | "*" | Comma-separated list of column names that should be included in the result file. The special value * includes all columns. The order of column names is significant: the output columns are in the order given. See also includeColumns input and colOrder parameter. The parameter supports regular expressions for column name matching. |
lowBound | string | "" | A row is included if the value in given numeric columns are at least as large as given numeric bounds. The parameter has a format COLNAME1=LOW1,COLNAME2=LOW2. For example, "col=5" includes rows where the column col has a value of at least 5. |
matchColumn | string | "" | Column name in "auxiliary" containing values that must match the ID column in "csv" (specified using "idColumn"). If empty, the first column of "auxiliary" is used. |
negate | boolean | false | If true, all inclusion criteria are negated so that rows or columns are excluded instead of included if they match the criteria. For example, includeColumns=* then excludes all columns (which would be an error since at least one column must be included). |
nonMissing | float | 0 | Include those rows that have non-missing (non-NA) values in at least this many columns. Only those columns that are part of the output are counted. If nonMissing is less than one, it is interpreted as a percentage value. Note that '1' is interpreted as an absolute value, not as 100 percent. |
regexp | string | "" | Row filtering based on regular expression. A row is included in the result if values in the given columns match given regular expressions. The parameter has a format COLNAME1=EXPRESSION,COLNAME2=EXPRESSION2 where COLNAMEs are column names in "csv" and EXPRESSIONs are regular expressions using Java syntax. For example, "col=a|b" includes rows where the column col has a value of "a" or "b". Commas have to be escaped with "\\" if they are used in column names or regular expressions. |
rename | string | "" | Comma-separated list of column renaming rules (OLDNAME=NEWNAME). Parameters that refer to column names refer to old names; renaming is done last. |
Test case | Parameters▼ | IN in |
IN aux |
IN includeColumnsFile |
OUT out |
|
---|---|---|---|---|---|---|
case01_names | properties | in | (missing) | (missing) | out | |
includeColumns=Gene,C3,C2, |
||||||
case02_aux | properties | in | aux | (missing) | out | |
matchColumn=A2 |
||||||
case03_colstar | properties | in | (missing) | (missing) | out | |
includeColumns=*, |
||||||
case04_missing | properties | in | (missing) | (missing) | out | |
includeColumns=C3,C2, |
||||||
case05_bounds | properties | in | (missing) | (missing) | out | |
lowBound=C1=2,C2=3, |
||||||
case06_nofilter | (missing) | in | (missing) | (missing) | out | |
case07_negate | properties | in | (missing) | (missing) | out | |
includeColumns=C1,C2, |
||||||
case08_rename | properties | in | (missing) | (missing) | out | |
includeColumns=Gene,C1,C4, |
||||||
case09_negateCols | properties | in | (missing) | (missing) | out | |
includeColumns=col1, |
||||||
case10_includeCols | properties | in | (missing) | includeColumnsFile | out | |
includeColumns = B,A |
||||||
case11 | properties | in | (missing) | includeColumnsFile | out | |
includeColumns = B,A, |
||||||
case12_aux_empty | (missing) | in | aux | (missing) | out | |
case13_misPercent | properties | in | (missing) | (missing) | out | |
includeColumns=C3,C2, |
||||||
case14_includeCols_regex | properties | in | (missing) | (missing) | out | |
includeColumns = B,A,.*match.* |