Converts IDs from a column in a CSV file to other type of IDs and/or combines duplicated rows into one. The conversion is done using a conversion table.
This component can be used to make an ID column unique by giving the same CSV file as both "csv" and "conversionTable" input, using the same column for sourceColumn and conversionColumn, and giving unique=true.
Version | 1.0 |
---|---|
Bundle | tools |
Categories | Convert |
Specialties | generic |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | commons-math3-3.2.jar (jar) ; commons-primitives-1.0.jar (jar) |
Source files | component.xml IDConvert.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | T (generic) | Mandatory | Source CSV file. Contains the column that is to be converted. The column is named by the parameter 'sourceColumn'. |
conversionTable | CSV | Mandatory | Conversion table. Must contain the column named by the parameter 'conversionColumn'. |
Name | Type | Description |
---|---|---|
out | T (generic) | Converted CSV file. This is a modified copy of the source CSV file that contains the converted column. |
Name | Type | Default | Description |
---|---|---|---|
collapseNumeric | string | "median" | Method for collapsing numeric columns with duplicate rows
(when unique=true). Legal values are "median" (take median of
non-NA values), "mean", "sum", "max", "min", "first" (take the first
row), "majority" (take the value that is present on the largest
number of rows), "consensus" (require that all rows have the
same value) and "indicator". Here, "majority" and "consensus" are mostly useful for discrete (integer) values). For "consensus", it is an error if duplicate rows contain distinct values. "indicator" assumes that data are divided into negative (-1), neutral (0) and positive (1) items, such as discretized expression values. The result is (a) 1 if there is a 1 and no -1 value; (b) -1 if there is a -1 and not 1 value; (c) 0 if there is a 0 and no -1 or 1; or (d) NA otherwise. |
collapseString | string | "consensus" | Method for collapsing non-numeric columns with duplicate rows (when unique=true). Legal values are "first" (take the first row), "majority" (take the value that is present on the largest number of rows), "longest" (take the longest value), "shortest" (take the shortest value) and "consensus" (require that all rows have the same value). For "consensus", it is an error if duplicate rows contain distinct values. |
conversionColumn | string | (no default) | Column in 'conversionTable' that is used to fetch new values for the source column. If the parameter is empty, the first column is used. |
dropMissing | boolean | false | Determines whether rows with missing conversion values are dropped. See also 'originalWhenMissing'. |
keyColumn | string | "" | Column in 'conversionTable' that is used to match ID values from the source CSV. If the parameter is empty, the first column is used. |
originalWhenMissing | boolean | true | Determines how values are converted when the item is not found in the conversion table. If true, the original (unconverted) ID is inserted to the output. If false, NA is inserted. If 'dropMissing' is set true, NA rows are dropped from the output. |
sourceColumn | string | "" | Column in 'csv' that is going to be converted. If empty, the first column is used. |
split | boolean | false | If true, split values in the source column by commas before ID conversion. For example, if the source column contains the string "a,b,c", it is first split into "a", "b" and "c"; then, individual IDs are converted; finally, the converted IDs are joined by commas into a string of format "A,B,C". |
splitConverted | boolean | false | If true, split converted values by commas after ID conversion. A row with comma separated values in the target column will be split to otherwise identical rows but so that the rows have a single ID value in the target column. This is done before possible collapsing of duplicate rows. |
targetColumn | string | "" | Target column name in 'converted'. If empty, the column name is the same as the original column (sourceColumn). |
unique | boolean | false | If true, rows with identical identifiers (sourceColumn) are collapsed into one row so that sourceColumn is unique. Collapsing may result in error if duplicate rows contain conflicting values. See parameters collapseNumeric and collapseString for methods to collapse numeric and non-numeric columns, respectively. |
varMeasure | string | "" | Method for calculating the variation of numeric duplicate
values. If the variation is above the threshold given in
varThreshold , duplicate values are collapsed to
NA. Possible values are 'MAD' and 'variance'. 'MAD' stands for the
median absolute deviation and 'variance' for the population
variance. If no values is given, no threshold for combining numeric
values is used. |
varThreshold | float | 2 | Maximum allowed variability in the duplicate values.
If varMeasure is given duplicate values with variance
above the threshold are collapsed to NA |
Test case | Parameters▼ | IN in |
IN conversionTable |
OUT out |
||
---|---|---|---|---|---|---|
case1 | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case10_unique_max | properties | in | conversionTable | out | ||
unique=true, |
||||||
case11_unique_min | properties | in | conversionTable | out | ||
unique=true, |
||||||
case12_longest | properties | in | conversionTable | out | ||
unique=true, |
||||||
case13_dropMissing | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case14_sum | properties | in | conversionTable | out | ||
unique=true, |
||||||
case15_consensus_numeric | properties | in | conversionTable | (expecting failure) | ||
unique=true, |
||||||
case2_split | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case3_unique_median | properties | in | conversionTable | out | ||
unique=true, |
||||||
case4_unique2 | properties | in | conversionTable | out | ||
unique=true, |
||||||
case5_missing | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case6_split_converted | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case7_split_converted_unique | properties | in | conversionTable | out | ||
sourceColumn=Column2, |
||||||
case8_mad_unique | properties | in | conversionTable | out | ||
unique=true, |
||||||
case9_unique_mean | properties | in | conversionTable | out | ||
unique=true, |