Java implementation of R CSVJoin component.
Joins rows from two or more CSV files from all the inputs, optionally using one column as a matching key.
If a key column is not used, the result contains all rows and all columns of the input files. Missing values (NA) may be introduced when a column is not present in all input files. Each column is present once and duplicate rows are removed.
If a key column is used, the rows in each input file are matched using values from the key column. The result file has one row for each key value. In the result, the first column is the key column; its name is obtained from the first CSV input (csv1). If the intersection parameter is true, a key is included in the result if the key value is present in all inputs. If intersection is false, a key is included if it is present in at least one input (key union). Union semantics may introduce NA values in the result. If several input files have the same column, the value is obtained from the first file. However, if the first file contains a missing value (NA) and the second file contains a non-missing value, the non-missing value is used instead.
For more complex join operations, see TableQuery. You may also use CSVListJoin to join multiple large files efficiently.
Version | 1.0 |
---|---|
Bundle | tools |
Categories | Convert |
Specialties | generic |
Authors | Vladimir Rogojin (vladimir.rogojin@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml JCSVJoin.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
csv1 | CSV | Optional | CSV file 1. |
csv2 | CSV | Optional | CSV file 2. |
csv3 | CSV | Optional | CSV file 3. |
csv4 | CSV | Optional | CSV file 4. |
csv5 | CSV | Optional | CSV file 5. |
csv6 | CSV | Optional | CSV file 6. |
csv7 | CSV | Optional | CSV file 7. |
csv8 | CSV | Optional | CSV file 8. |
csvDir | BinaryFolder | Optional | Directory containing CSV files. |
array | Array<CSV> | Optional | Array containing CSV files. |
Name | Type | Description |
---|---|---|
csv | T (generic) | Result CSV file. |
Name | Type | Default | Description |
---|---|---|---|
arrayKeyColumnName | string | "" | The key column name in all csv files from array. |
csvDirKeyColumnName | string | "" | The key column name in all csv files from csvDir. |
csvDirRegexp | string | ".*" | If input port csvDir is connected, then this pattern is used to select files from csvDir. In contrast to original R CSVJoin component, all subdirs of csvDir and all the files in all subdirs are being considered. |
intersection | boolean | true | Defines how keys are handled; only used when useKeys=true. If intersection is true, the result contains a key if the key is present in all input files. If false, the result contains a key if the key is present in at least one input file. |
keyColumnNames | string | "" | Comma-separated list of key column names for csv1, csv2, ..., csv8 in-ports; only used when useKeys=true. The first name refers to csv1, the second to csv2, etc. An empty value refers to the first column. Empty values may be omitted from the list, so all these are equivalent: "col1" ; "col1," ; "col1,," ; etc. To define keyColumnNames for files from csvDir and from array check parameters csvDirKeyColumnName and arrayKeyColumnName respectively. |
minRows | int | 0 | Fail component if there are less than minRows rows of data (excluding the header). |
useKeys | boolean | true | If true, use one column from each CSV file as a matching key column and the columns are combined. If false, rows are combined without a join. |
Test case | Parameters▼ | IN csv1 |
IN csv2 |
IN csv3 |
IN csv4 |
IN csv5 |
IN csv6 |
IN csv7 |
IN csv8 |
IN csvDir |
IN array |
OUT csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
case1 | (missing) | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
case10_array_dir_files | properties | csv1 | csv2 | csv3 | csv4 | csv5 | csv6 | csv7 | csv8 | csvDir | array | csv |
useKeys=false |
||||||||||||
case2_union | properties | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
intersection=false |
||||||||||||
case3_nokeys | properties | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
useKeys=false |
||||||||||||
case4_names | properties | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
keyColumnNames=KEY1,KEY2 |
||||||||||||
case5_many | properties | csv1 | csv2 | csv3 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
useKeys=false |
||||||||||||
case6_sepnames | (missing) | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
case7_sepnames_nokeys | properties | csv1 | csv2 | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csv |
useKeys=false |
||||||||||||
case8_many | properties | csv1 | csv2 | csv3 | csv4 | csv5 | csv6 | csv7 | csv8 | (missing) | (missing) | csv |
useKeys=false |
||||||||||||
case9_dir | properties | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | csvDir | (missing) | csv |
intersection=false |