Sorts and merges CSV files.
The component may either sort a single CSV or merge an array of CSVs into a sorted CSV. Sorting can be performed by a number of key fields in descending/ascending order. Fields may be either strings (all characters), words (only alphabetic symbols and digits), natural or real numbers.
The component invokes sort GNU sort utility and supports external sorting (Large files non-fitting into RAM may be sorted). NOTE: In case of extremely large input CSVs (bigger than RAM) it is recommended to disable preliminary sorted check (set skipSortCheck=true) if it is not needed.
Version | 1.2 |
---|---|
Bundle | tools |
Categories | |
Authors | Vladimir Rogojin (vladimir.rogojin@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | GNU sort |
Source files | component.xml CSVSort.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | CSV | Optional | An input CSV. If array is defined, this CSV is merged with CSVs from the array |
inArray | Array<CSV> | Optional | An array of input CSVs. If defined, all CSVs from the array to be merged. If csv is defined, it will be merged with the array. |
Name | Type | Description |
---|---|---|
out | CSV | The sorted CSV |
status | TextFile | Value sorted if all input CSVs are sorted, value unsorted if at least one of the input files is not sorted, value unknown if no sort check was performed. |
Name | Type | Default | Description |
---|---|---|---|
ignoreCase | string | "" | Comma separated list of case-insensitive key columns. |
keyColumns | string | "" | Ordered comma separated list of key columns by which to sort CSV files. Sorting is performed first by the first column, then by the second column, etc. By default, the first column from CSVs is considered to be the key column. NOTE: all CSVs should have same columns following in the same order! |
mode | string | "" | Comma separated list of sorting modes for each column. Format column_name=mode. Modes: asc (ascending mode), des (descending mode). By default all columns are sorted in ascending order. |
noSort | boolean | false | Should we actually sort the CSV/CSVs or we need just to check whether they are sorted? If TRUE, no actual sorting to be performed, the empty CSV is produced, all CSVs are checked to be sorted. |
preprocess | boolean | true | If input CSVs contain quotes delimiting columns and/or spaces inside columns' fields, preprocess=true should be used to enable transformation of input CSVs into suitable for the sort utility format. |
regex | string | ".*" | Java regular expression for keys of CSVs from the input array to merge/sort. |
skipSortCheck | boolean | false | Should we skip sort check? If TRUE, no initial check whether input CSVs are sorted will be performed. It is recommended to set it TRUE to enhance performance in case of large input CSVs. If no sorted check is performed, the component will return choice unknown. |
sortCmd | string | "sort" | The command to invoke sort utility. |
stable | boolean | true | Stable option disables last-resort comparison. |
types | string | "" | Comma separated list of column types. Format: column_name = column_type. There are defined the following column types: string (any character sequence), word (alphabetic letters and/or digits sequence), natural (integer non-negative number), real (real number). By default all columns are of type natural. |
unique | boolean | false | Return unique rows only. |
Test case | Parameters▼ | IN in |
IN inArray |
OUT out |
OUT status |
|
---|---|---|---|---|---|---|
case1 | (missing) | in | (missing) | out | (missing) | |
case10_unique_rows | properties | in | (missing) | out | (missing) | |
unique=true |
||||||
case2 | (missing) | in | (missing) | out | (missing) | |
case3 | properties | in | (missing) | out | (missing) | |
keyColumns=Col2,Col3,Col4, |
||||||
case4 | properties | in | (missing) | out | (missing) | |
keyColumns=Col2,Col3,Col4, |
||||||
case5 | properties | in | inArray | out | (missing) | |
keyColumns=Col1,Col3,Col2, |
||||||
case6 | properties | (missing) | inArray | out | (missing) | |
keyColumns=Col1,Col3,Col2, |
||||||
case7 | properties | (missing) | inArray | out | (missing) | |
keyColumns=Col1,Col3,Col2, |
||||||
case8 | properties | in | (missing) | out | (missing) | |
keyColumns=MEDIANSURIVALPVALUE, |
||||||
case9 | properties | in | (missing) | out | (missing) | |
keyColumns=Col_2,Col3, |