Computes DNA region set operations such as union and overlap. Transformations are defined using an expression syntax like the following: union(r1, length(5, 15, "r2")). Here, union and length are functions and r1 and r2 refer to region sets keys in input array. Region set references can be quoted or unquoted. See SetTransformer component for an analogous interface for computing set operations.
This component is implemented using, and follows the API of, GROK (Genomic
Region Operation Kit). See GROK documentation for
details. The GROK API can be used to implement more complex analyses than is
possible with the simple function expression described below; a full Python
script can be supplied with the script
input port. Test cases
present some common use cases. However they weren't written for this Anduril
component originally, so they may not be representative of real use of this
component - please contribute.
Region transformation functions are divided into three basic types. Type P (identity preserving) functions preserve region identity and consider regions as indivisible relations. Type A (annotation preserving) functions modify properties of regions, but migrate annotations to new regions. Type N (non-preserving) functions operate at sequence (location) level and only preserve score annotations; new scores are computed using customizable aggregate functions.
Function definitions are below. Here, regs...
indicates
any number of region set arguments and reg/reg1/reg2
indicate single region sets. [X]
indicates optional argument.
Functions may also have integer or string arguments. Functions with
definitions of the form reg.function are method calls: "reg" is a
region set reference, such as reg1.
Definition | Type | Description |
---|---|---|
union(regs...) | P | Regions that are present in any region set. |
unionL(regs...) | N | Locations that are present in any region set. |
intersection(regs...) | P | Regions that are present in all region sets. |
intersectionL(regs...) | N | Locations that are present in all region sets. |
freq(low, high, regs...) | P | Regions that are present in at least low and
at most high region sets. |
freqL(low, high, regs...) | N | Locations that are present in at least low and
at most high region sets. |
diff(reg1, reg2) | P | Regions that are present in reg1 but not in reg2 . |
diffL(reg1, reg2) | N | Locations that are present in reg1 but not in reg2 . |
reg.strand(n) | P | Regions whose strand matches given n (numeric):
-1 for reverse, 0 for any or 1 for forward strand. |
reg.expand(start, end) | A | Expand regions by start elements from start
of region and by end elements from end of region.
Negative values shrink regions. Takes strands into account. |
reg.shift(n) | A | Shift regions by n positions in sense direction
(right for forward strand, left for reverse strand). Negative values
shift in anti-sense direction. |
reg.flip([fixed]) | A | Change the strand of regions: forward becomes reverse and
vice versa. If fixed (numeric: -1/0/1) is given, all
regions have this strand instead. |
reg.merge([gap]) | N | Merge regions whose gap is at most gap positions.
gap defaults to 0.
|
Version | 0.3 |
---|---|
Bundle | sequencing |
Categories | |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi), Lauri Lyly (lauri.lyly@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | python ; python-dev (DEB) ; installer (bash) |
Source files | component.xml RegionTransformer.py |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
regions | Array<DNARegion2> | Optional | Source region sets. In the transformation expression, each region set is referred to by its key in the array. I.e. if the key is "r1" then a FileReader region set will be in the variable "r1". An alternative way to access the region sets is via the "readers" dictionary. That is, readers["r1"] would yield the same object. |
region_set | DNARegion2 | Optional | Single source region set. In the transformation expression, this region set is referred to as "region_set". |
folder | BinaryFolder | Optional | Source region sets - can be of any file type known by GROK. In the transformation expression, each region set is referred to via its filename, as folder["my_regions.csv"]. This yields a FileReader region set. |
scriptFile | PythonSource | Optional | Python script to evaluate instead of the transform function parameter, if specified. The "regions" array's keys are turned into local variables and may be used to refer to corresponding FileReader region stores in the script. The script will have certain other variables visible in its global scope, including all GROK functions, and variables corresponding to the inputs and outputs. |
Name | Type | Description |
---|---|---|
result | DNARegion2 | Result region set, which is set if you specify the transformation expression in the "script" parameter instead of the input. You can also write to this from a script by storing a region set in the "result" variable. The result will be written to this output port. An alternative way is to write to the file cf.get_output("result") directly, with a writer or without. |
array | Array<DNARegion2> | Optional array of produced region sets. Files are automatically added to the array by writing array["myfile"]. The file extension is appended, defaults to "csv" and may be specified with e.g. array.set_type("bam"). All GROK's output types are supported. Inside the script, the variable called array is actually an "AndurilOutputArray" from the Python anduril module. |
folder | BinaryFolder | Optional output folder. The path is visible in the script as "folderOutput". A convenient way to specify the path is then e.g. os.path.join(folderOutput, "myfile.csv"). For this, you need to import the os module. |
Name | Type | Default | Description |
---|---|---|---|
script | string | "" | Same as the script input file, except can contain only a single expression whose value is written to the "result" output. This is evaluated with Python's "eval" function. Used only if corresponding input is unspecified. |
Test case | Parameters▼ | IN regions |
IN region_set |
IN folder |
IN scriptFile |
OUT result |
OUT array |
OUT folder |
---|---|---|---|---|---|---|---|---|
case1 | properties | regions | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
script=diffL(r1, r2) |
||||||||
case2-annotations | properties | regions | (missing) | (missing) | scriptFile | result | array | folder |
case3-filtering | properties | regions | (missing) | (missing) | scriptFile | result | array | (missing) |
script=diffL(r1, r2) |
||||||||
case4-overlap | properties | regions | (missing) | (missing) | scriptFile | result | array | (missing) |
script=diffL(r1, r2) |
||||||||
case5-partition | properties | regions | (missing) | (missing) | scriptFile | result | array | (missing) |
script=diffL(r1, r2) |
||||||||
case6-read-iterate | properties | regions | (missing) | folder | scriptFile | result | array | folder |
script=diffL(r1, r2) |
||||||||
case7-setops | properties | regions | (missing) | folder | scriptFile | result | array | (missing) |
script=diffL(r1, r2) |
||||||||
case8-stores | properties | regions | (missing) | (missing) | scriptFile | result | array | folder |
script=diffL(r1, r2) |
||||||||
case9-transformations | properties | regions | (missing) | (missing) | scriptFile | result | array | (missing) |
script=diffL(r1, r2) |