Cleans up CSV outputs, optionally removes file headers, quotations and unused columns, and can reorder and rename columns.
Possible character encodings for the missing value symbols and column separators:
character | escape |
---|---|
empty string | \e |
space | \s |
carriage return | \r |
new line | \n |
tab | \t |
quotation mark | \q |
semicolon | \c |
Version | 2.2 |
---|---|
Bundle | tools |
Categories | Convert |
Authors | Marko Laakso (Marko.Laakso@Helsinki.FI) |
Issue tracker | View/Report issues |
Source files | component.xml |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | CSV | Mandatory | Input file to be cleaned |
Name | Type | Description |
---|---|---|
out | CSV | Simplified CSV |
Name | Type | Default | Description |
---|---|---|---|
autoRename | string | "" | Duplicate column names will be renamed uniquely if this delimiter string is not empty. The new column names are formed by adding an integer counter after the column name and the given delimiter. For example a hyphen (-) converts {a, a, a-1, a-3, a} to {a, a-2, a-1, a-3, a-4}. |
columnsIn | string | "" | A comma separated list of column names for the input columns. An empty string means that that the column names are defined on the first input row. |
columnsOut | string | "*" | Comma separated list of column selections for the output. An asterisk (*) may be used for all columns. |
delimIn | string | "\t" | Column delimiter for the input |
delimSymbol | string | "\t" | Column delimiter for the output |
dropHeader | boolean | false | This flag will eliminate column names from the output. |
fillRows | boolean | false | Accept rows with too few columns and complete them with missing values. |
naIn | string | "NA" | Missing value symbol for the input |
naSymbol | string | "NA" | Missing value symbol for the output |
numberFormat | string | "" | A line feed (\n) separated list of decimal formats for the columns. Each entry consists of the column name and the Java DecimalFormal pattern separated with equal sign. For example, rounding to three decimals can be done like: myColumn=#0.000. |
rename | string | "" | Comma separated list of column renaming rules (oldname=newname) |
replace | string | "" | A line feed (\n) separated list of column specific search replace rules. Each entry consists of three lines: column name, regular expression for the replacement keys, and the substitution patterns. The syntax of the keys and substitutions follows Java regular expressions. |
rowSkip | int | 0 | Skip this many lines from the input before reading the tabular. |
skipQuotes | string | "" | Comma separated list of output columns names that should not have quotation marks. An asterisk (*) may be used for all columns. |
trim | boolean | false | Remove leading and trailing whitespaces from the field values. |
Test case | Parameters▼ | IN in |
OUT out |
|||
---|---|---|---|---|---|---|
case1 | properties | in | out | |||
columnsOut = Sample,value,name, |
||||||
case2 | properties | in | out | |||
columnsOut = Sample,name, |
||||||
case3 | properties | in | out | |||
naSymbol = -N\\eA-, |
||||||
case4 | properties | in | out | |||
columnsOut = C,B, |
||||||
case5 | properties | in | out | |||
columnsIn = A,B,C, |
||||||
case6 | properties | in | out | |||
autoRename = - |
||||||
case7 | properties | in | out | |||
fillRows = true |
||||||
case8 | properties | in | out | |||
numberFormat = B=%\nA=#.##\nC=O.###E0 |
||||||
case9 | properties | in | out | |||
naIn = \\e, |
||||||
case9b | properties | in | out | |||
naIn = \\e, |