Up: Component summary Component

GroupFiles

Produces a CSV annotated by parsing path names containing named groups. Each file gets its own line. Matching can be UNIX command line style ( e.g. "*/*.txt" or "199?.txt" ) or you can use regular expressions. In both cases, named groups are allowed by using {groupName}. The group name becomes a column name in the CSV file with the value that is taken from the filename. Groups may specify a regular expression after a comma, which in turn may contain ordinary nested groups using (). Those nested groups are ignored when producing CSV columns, though they are used in matching as usual, so it is possible to specify e.g. "files that end in either .fa or .fasta". The "slashless" mode allows matching to arbitrary path names, so even the groups can contain slashes in their regular expression. The default regular expression will match any characters. Be sure to disallow unwanted characters with character classes, e.g. "[^\/]" if they are not desired. Notice that in AndurilScript it is easier to specify regular expressions with single quotes because otherwise escaping becomes difficult. When using regex, also remember to add a $ when you want the filename to end and you know there are path names which can resemble your group naming patterns, because partial (prefix of the path) matches are allowed by default! The same may apply to the beginning of the filename "^", though with paths such cases are not common. As a curiosity, the slashless mode performs the directory walk by matching each directory before descending, which enables it to traverse the minimum amount of directories. With patterns like "ends in 9" it will still traverse every possible directory.

Version 0.1
Bundle tools
Categories External
Authors Lauri Lyly (lauri.lyly@helsinki.fi)
Issue tracker View/Report issues
Requires python ; Boost regex library ; Boost python library
Source files component.xml main.py
Usage Example with default values

Inputs

Name Type Mandatory Description
in BinaryFolder Optional Glob filenames relative to this root. Otherwise the pattern is assumed to be absolute. For UNIX style globbing, this root is simply added as a prefix. For slashless matching, the directory walk needs a starting directory. It is assumed to be the directory prefix of the path that hasn't got any named groups (identified by {} characters), e.g. '/this/part/but_{not_this}/nor/this'. If there are special characters in the prefix, nothing will likely be found.

Outputs

Name Type Description
out CSV CSV file with a filename column and other columns specified by the named groups.

Parameters

Name Type Default Description
filenameColumn string "File" Column name in which filenames are written
pattern string "*" Path pattern with UNIX style wildcards (e.g. *.txt) that specifies column names. If slashless mode is used the pattern is treated as a regular expression instead. Columns names may be specified by enclosing them in {}. A subpattern which is always a regular expression (.* instead of * and . instead of ?), for the groups, can be specified after an optional comma, as in {group,subpattern}
slashless boolean false Allow pattern to match whole pathname, allows wildcards to match to multiple directories in a path. This means your pattern must be compatible with both Boost regex library and Python regular expressions. As this feature depends on a C++ extension, you should have boost regex and boost python libraries installed. The module is built automatically.

Test cases

Test case Parameters IN
in
OUT
out
case_glob properties (missing) out

pattern=*/*/*9.txt,
filenameColumn=TRACK_ID,
slashless=false

case_glob_group properties (missing) out

pattern=*/*/{group,[^\\/_]+}_{name},
filenameColumn=SLACK_ID,
slashless=false

case_glob_group_root2 properties in out

pattern=*HUMxdtS*/*/{Sample,\\d*}-miRNA-{Time,\\d*H},
filenameColumn=File,
slashless=false

case_glob_group_root3 properties in out

pattern=*HUMbzsO*/*/{Sample,[\\dIM-]*}-{Time,\\d*(H|min)}{Dash,(-|)}{Replicate,(\\d*|)}/*{Mate,\\d}.fq.gz,
filenameColumn=File,
slashless=false

case_glob_group_root4 properties in out

pattern=*HUMareR*/*/{Sample,\\d*}_DNA_{Number,\\d*}_{INumber,I\\d*}_{Serial1,[^_]*}_{Lane,L\\d}_{Serial2,[^_]*}_{Mate,\\d}.fq.gz,
filenameColumn=File,
slashless=false

case_slashless properties (missing) out

pattern=./.*{numEnders,.*(9|1)}.txt,
filenameColumn=TRACK_ID,
slashless=true

case_slashless_group_roo2t properties in out

pattern=.*HUMxdtS.*/{Sample,\\d*}-miRNA-{Time,\\d*H}[^\\/]*$,
filenameColumn=File,
slashless=true

case_slashless_group_root properties in out

pattern=.*/{group,[^\\/_]+}_{name,[^\\/]+}$,
filenameColumn=SLACK_ID,
slashless=true

case_slashless_group_root3 properties in out

pattern=.*HUMbzsO.*/{Sample,[\\dIM-]*}-{Time,\\d*(H|min)}(-|){Replicate,(\\d*|)}[^\\/]*/[^\\/]*{Mate,\\d}.fq.gz$,
filenameColumn=File,
slashless=true

case_slashless_group_root4 properties in out

pattern=.*HUMareR.*/{Sample,\\d*}_DNA_{Number,\\d*}_{INumber,I\\d*}_{Serial1,[^_]*}_{Lane,L\\d}_{Serial2,[^_]*}_{Mate,\\d}.fq.gz,
filenameColumn=File,
slashless=true


Generated 2019-02-08 07:42:16 by Anduril 2.0.0