Post-processing tool for gene fusion analysis.
Warning: Do not run more than one instance of this component at a time or the database breaks. The pegasus database cannot handle more than one connection at a time. If you really need to then you need to have different pegasus installations for each one of them.
The component takes as input an Array of FusionCaller outputs. the folders must contain a file ".tool" that names the tool used, and a file "final" that is used as a data input for this component. Analyses several fusion detection tools (ChimeraScan, FusionCatcher, deFuse, SOAPfuse, EricScript, StarFusion):
ChimeraScan output file: "chimeras.bedpe" FusionCatcher output file: "final-list_candidate-fusion-genes.txt" deFuse output file: "results.tsv" or "results.filtered.tsv" or "results.classify.tsv" SOAPfuse output file: "*.final.Fusion.specific.for.genes" EricScript output file: "*.results.total.tsv" or "*.results.filtered.tsv" StarFusion output file: star-fusion.fusion_predictions.abridged.tsv
The valid values for .tool file are: "chimerascan", "fusioncatcher", "defuse", "ericscript", "soapfuse", "starfusion". You can also use the short acronyms: CS,FS,DF,ES,SF,Star
It is possible to skip the filtering step and use the csvs input to run pegasus and oncofuse. The csv array needs to have the fusions in the following format: gene1_name gene2_name ensembl_gene_id1 ensembl_gene_id2 start_position1:end_position1:gene_strand1 start_position2:end_position2:gene_strand2 chr1:bp1-chr2:bp2 span_cnt encompassing_cnt The names of the column headers do not matter, but the order has to be respected and the format for giving the gene coordinate. Calls from different tools can be merged in the same file, but each sample needs to have its own file since that is how they are submitted to pegasus. Since there is no sample column in the file, each filename will be used as sample name.
The installation of pegasus created using the installation script found in sequencing/lib allows Pegasus to run with hg38 build. It is possible that errors may have been introduced when doing this conversion, so use under your own risk.
The component consists in several steps:
The component outputs are: A main output file containing a fusion list with high oncogenic probability (as described at step 4), and an output folder. The output folder contains:
$PEGASUS_HOME=Pegasus_home_directory java -jar $PEGASUS_HOME/jars/QueryFusionDatabase.jar -t FUSIONS_COMPLETE_ID -c deleteAll -d $PEGASUS_HOME/resources/hsqldb-2.2.7/hsqldb/mydb
Before run the compenent install the required tools (FusionMatcher, Pegasus, Oncofuse) with the respective installation scripts in $ANDURIL_HOME/bundles/sequencing/lib/install_scripts/.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | Analysis Annotation |
Authors | Gabriele Partel (gabrielepartel@gmail.com) |
Issue tracker | View/Report issues |
Requires | Python 2.7.x ; HTSeq (python) ; numpy (python) ; pandas (python) ; sklearn (python) ; biomaRt (R-bioconductor) ; installer (bash) |
Source files | component.xml main.sh |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | Array<BinaryFolder> | Optional | Input folders (e.g. from FusionCaller). Folder must contain .tool file containing the name of the tool used |
csvs | BinaryFolder | Optional | All properly formatted csv files in one folder with each file named ID_TOOL. |
Name | Type | Description |
---|---|---|
out | CSV | Output file containing a list of fusion predicitions with high oncogenic probability. |
folder | BinaryFolder | Output folder with fesults from all the steps. |
Name | Type | Default | Description |
---|---|---|---|
datasetID | string | "tumor" | Identifying dataset ID. |
filter | int | 1 | Minimum number of spanning reads mapping across the fusion junction. All the fusions with a smaller number of supporting reads that span the junction, will be filtered out. |
keepDB | boolean | true | If false, erases Pegasus fusion database before run. |
matching | string | "subset" | FusionMatcher matching method (overlap, subset, egm). Overlap matches when two gene set have one or more genes overlapping. Subset matches when one gene set is a subset of the other. EGM is exact gene matching; all genes in both sets need to be identical to match (see fusionmatcher doc for info). |
pegasus_home | string | "" | If left blank it will use anduril's installation of pegasus. |
runFuma | boolean | false | Run fuma. It tends to take a long time to run. |
skipFiltering | boolean | false | If you have already filtered the fusions and have them in the correct format (gene1_name gene2_name gene1_ID gene2_ID gene1_coord gene2_coord bp spanning encompassing). |
th1 | string | "0.7" | Threshold 1 used in Step 4. |
th2 | string | "0.8" | Threshold 2 used in Step 4. |
th3 | string | "0.4" | Threshold 3 used in Step 4. |
th4 | string | "0.5" | Threshold 4 used in Step 4. |
tissueType | string | "AVG" | Oncofuse tissue_type parameter, which tells Oncofuse to use its own pre-built gene expression libraries. There are four pre-built libraries, corresponding to the four supported tissue types: EPI (epithelial origin), HEM (hematological origin), MES (mesenchymal origin) and AVG (average expression, if tissue source is unknown). |
Test case | Parameters▼ | IN in |
IN csvs |
OUT out |
OUT folder |
|
---|---|---|---|---|---|---|
case1 | properties | in | (missing) | (missing) | folder | |
keepDB=false, |
||||||
case2 | properties | (missing) | csvs | (missing) | folder | |
keepDB=false, |