Produces a CSV annotated by parsing path names containing named groups. Each file gets its own line. Matching can be UNIX command line style ( e.g. "*/*.txt" or "199?.txt" ) or you can use regular expressions. In both cases, named groups are allowed by using {groupName}. The group name becomes a column name in the CSV file with the value that is taken from the filename. Groups may specify a regular expression after a comma, which in turn may contain ordinary nested groups using (). Those nested groups are ignored when producing CSV columns, though they are used in matching as usual, so it is possible to specify e.g. "files that end in either .fa or .fasta". The "slashless" mode allows matching to arbitrary path names, so even the groups can contain slashes in their regular expression. The default regular expression will match any characters. Be sure to disallow unwanted characters with character classes, e.g. "[^\/]" if they are not desired. Notice that in AndurilScript it is easier to specify regular expressions with single quotes because otherwise escaping becomes difficult. When using regex, also remember to add a $ when you want the filename to end and you know there are path names which can resemble your group naming patterns, because partial (prefix of the path) matches are allowed by default! The same may apply to the beginning of the filename "^", though with paths such cases are not common. As a curiosity, the slashless mode performs the directory walk by matching each directory before descending, which enables it to traverse the minimum amount of directories. With patterns like "ends in 9" it will still traverse every possible directory.
Version | 0.1 |
---|---|
Bundle | tools |
Categories | External |
Authors | Lauri Lyly (lauri.lyly@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | python ; Boost regex library ; Boost python library |
Source files | component.xml main.py |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | BinaryFolder | Optional | Glob filenames relative to this root. Otherwise the pattern is assumed to be absolute. For UNIX style globbing, this root is simply added as a prefix. For slashless matching, the directory walk needs a starting directory. It is assumed to be the directory prefix of the path that hasn't got any named groups (identified by {} characters), e.g. '/this/part/but_{not_this}/nor/this'. If there are special characters in the prefix, nothing will likely be found. |
Name | Type | Description |
---|---|---|
out | CSV | CSV file with a filename column and other columns specified by the named groups. |
Name | Type | Default | Description |
---|---|---|---|
filenameColumn | string | "File" | Column name in which filenames are written |
pattern | string | "*" | Path pattern with UNIX style wildcards (e.g. *.txt) that specifies column names. If slashless mode is used the pattern is treated as a regular expression instead. Columns names may be specified by enclosing them in {}. A subpattern which is always a regular expression (.* instead of * and . instead of ?), for the groups, can be specified after an optional comma, as in {group,subpattern} |
slashless | boolean | false | Allow pattern to match whole pathname, allows wildcards to match to multiple directories in a path. This means your pattern must be compatible with both Boost regex library and Python regular expressions. As this feature depends on a C++ extension, you should have boost regex and boost python libraries installed. The module is built automatically. |
Test case | Parameters▼ | IN in |
OUT out |
|||
---|---|---|---|---|---|---|
case_glob | properties | (missing) | out | |||
pattern=*/*/*9.txt, |
||||||
case_glob_group | properties | (missing) | out | |||
pattern=*/*/{group,[^\\/_]+}_{name}, |
||||||
case_glob_group_root2 | properties | in | out | |||
pattern=*HUMxdtS*/*/{Sample,\\d*}-miRNA-{Time,\\d*H}, |
||||||
case_glob_group_root3 | properties | in | out | |||
pattern=*HUMbzsO*/*/{Sample,[\\dIM-]*}-{Time,\\d*(H|min)}{Dash,(-|)}{Replicate,(\\d*|)}/*{Mate,\\d}.fq.gz, |
||||||
case_glob_group_root4 | properties | in | out | |||
pattern=*HUMareR*/*/{Sample,\\d*}_DNA_{Number,\\d*}_{INumber,I\\d*}_{Serial1,[^_]*}_{Lane,L\\d}_{Serial2,[^_]*}_{Mate,\\d}.fq.gz, |
||||||
case_slashless | properties | (missing) | out | |||
pattern=./.*{numEnders,.*(9|1)}.txt, |
||||||
case_slashless_group_roo2t | properties | in | out | |||
pattern=.*HUMxdtS.*/{Sample,\\d*}-miRNA-{Time,\\d*H}[^\\/]*$, |
||||||
case_slashless_group_root | properties | in | out | |||
pattern=.*/{group,[^\\/_]+}_{name,[^\\/]+}$, |
||||||
case_slashless_group_root3 | properties | in | out | |||
pattern=.*HUMbzsO.*/{Sample,[\\dIM-]*}-{Time,\\d*(H|min)}(-|){Replicate,(\\d*|)}[^\\/]*/[^\\/]*{Mate,\\d}.fq.gz$, |
||||||
case_slashless_group_root4 | properties | in | out | |||
pattern=.*HUMareR.*/{Sample,\\d*}_DNA_{Number,\\d*}_{INumber,I\\d*}_{Serial1,[^_]*}_{Lane,L\\d}_{Serial2,[^_]*}_{Mate,\\d}.fq.gz, |