Up: Component summary Component


Post-processing tool for gene fusion analysis.

Warning: Do not run more than one instance of this component at a time or the database breaks. The pegasus database cannot handle more than one connection at a time. If you really need to then you need to have different pegasus installations for each one of them.

The component takes as input an Array of FusionCaller outputs. the folders must contain a file ".tool" that names the tool used, and a file "final" that is used as a data input for this component. Analyses several fusion detection tools (ChimeraScan, FusionCatcher, deFuse, SOAPfuse, EricScript, StarFusion):

ChimeraScan output file: "chimeras.bedpe" FusionCatcher output file: "final-list_candidate-fusion-genes.txt" deFuse output file: "results.tsv" or "results.filtered.tsv" or "results.classify.tsv" SOAPfuse output file: "*.final.Fusion.specific.for.genes" EricScript output file: "*.results.total.tsv" or "*.results.filtered.tsv" StarFusion output file: star-fusion.fusion_predictions.abridged.tsv

The valid values for .tool file are: "chimerascan", "fusioncatcher", "defuse", "ericscript", "soapfuse", "starfusion". You can also use the short acronyms: CS,FS,DF,ES,SF,Star

It is possible to skip the filtering step and use the csvs input to run pegasus and oncofuse. The csv array needs to have the fusions in the following format: gene1_name gene2_name ensembl_gene_id1 ensembl_gene_id2 start_position1:end_position1:gene_strand1 start_position2:end_position2:gene_strand2 chr1:bp1-chr2:bp2 span_cnt encompassing_cnt The names of the column headers do not matter, but the order has to be respected and the format for giving the gene coordinate. Calls from different tools can be merged in the same file, but each sample needs to have its own file since that is how they are submitted to pegasus. Since there is no sample column in the file, each filename will be used as sample name.

The installation of pegasus created using the installation script found in sequencing/lib allows Pegasus to run with hg38 build. It is possible that errors may have been introduced when doing this conversion, so use under your own risk.

The component consists in several steps:

  1. Step1: It matches predicted fusion events (fusion detection tool outputs) according to chromosomal location and corresponding annotated gene using FusionMatcher (https://github.com/yhoogstrate/fuma), producing the overlapping set of fusion shared between samples and tools.
  2. Step2: It filters out fusion candidates if: a) the number of supporting reads mapping across the fusion junction is less then "filter" parameter, b) the genes involved in the fusion are homologous, c) the genes involved in the fusion are not functionally annotated.
  3. Step3: It prioritizes and annotates fusion candidates predicting their oncogenic potential using Pegasus* (https://github.com/RabadanLab/Pegasus) and Oncofuse (http://www.unav.es/genetica/oncofuse.html).
  4. Step4: It combines the results of Pegasus and Oncofuse listing in the main output file only the fusions having oncogenic probability values (accordingly to Pegasus and Oncofuse) greather then the following thresholds:
    1. Pegasus oncogenic probability >= "th1" parameter
    2. Oncofuse oncogenic probability >= "th2" parameter
    3. (Pegasus oncogencic probability >= "th3" parameter) and (Oncofuse oncogenic probability >= "th4" parameter). (N.b.: th3 and th4 should be smaller then th1 and th2)

The component outputs are: A main output file containing a fusion list with high oncogenic probability (as described at step 4), and an output folder. The output folder contains:

  1. "overlapping.coord.txt" produced by FusionMatcher, describing the overlapping set of fusions.
  2. List of fusions that didn't pass the filtering step (SAMPLE_ID_TOOL.filtered_out.txt").
  3. Oncofuse output files ("SAMPLE_ID_TOOL.oncofuse.txt).
  4. Pegasus output file ("pegasus.output.txt").
  • Pegasus has an internal database where it stores each fusion analyzed to keep track of them if they will be found in other datasets in future runs. To erase this database type command:
        java -jar $PEGASUS_HOME/jars/QueryFusionDatabase.jar -t FUSIONS_COMPLETE_ID -c deleteAll -d $PEGASUS_HOME/resources/hsqldb-2.2.7/hsqldb/mydb
  • Consider installing your own pegasus if you want to test the database.
  • Before run the compenent install the required tools (FusionMatcher, Pegasus, Oncofuse) with the respective installation scripts in $ANDURIL_HOME/bundles/sequencing/lib/install_scripts/.

    Version 1.0
    Bundle sequencing
    Categories Analysis Annotation
    Authors Gabriele Partel (gabrielepartel@gmail.com)
    Issue tracker View/Report issues
    Requires Python 2.7.x ; HTSeq (python) ; numpy (python) ; pandas (python) ; sklearn (python) ; biomaRt (R-bioconductor) ; installer (bash)
    Source files component.xml main.sh
    Usage Example with default values


    Name Type Mandatory Description
    in Array<BinaryFolder> Optional Input folders (e.g. from FusionCaller). Folder must contain .tool file containing the name of the tool used
    csvs BinaryFolder Optional All properly formatted csv files in one folder with each file named ID_TOOL.


    Name Type Description
    out CSV Output file containing a list of fusion predicitions with high oncogenic probability.
    folder BinaryFolder Output folder with fesults from all the steps.


    Name Type Default Description
    datasetID string "tumor" Identifying dataset ID.
    filter int 1 Minimum number of spanning reads mapping across the fusion junction. All the fusions with a smaller number of supporting reads that span the junction, will be filtered out.
    keepDB boolean true If false, erases Pegasus fusion database before run.
    matching string "subset" FusionMatcher matching method (overlap, subset, egm). Overlap matches when two gene set have one or more genes overlapping. Subset matches when one gene set is a subset of the other. EGM is exact gene matching; all genes in both sets need to be identical to match (see fusionmatcher doc for info).
    pegasus_home string "" If left blank it will use anduril's installation of pegasus.
    runFuma boolean false Run fuma. It tends to take a long time to run.
    skipFiltering boolean false If you have already filtered the fusions and have them in the correct format (gene1_name gene2_name gene1_ID gene2_ID gene1_coord gene2_coord bp spanning encompassing).
    th1 string "0.7" Threshold 1 used in Step 4.
    th2 string "0.8" Threshold 2 used in Step 4.
    th3 string "0.4" Threshold 3 used in Step 4.
    th4 string "0.5" Threshold 4 used in Step 4.
    tissueType string "AVG" Oncofuse tissue_type parameter, which tells Oncofuse to use its own pre-built gene expression libraries. There are four pre-built libraries, corresponding to the four supported tissue types: EPI (epithelial origin), HEM (hematological origin), MES (mesenchymal origin) and AVG (average expression, if tissue source is unknown).

    Test cases

    Test case Parameters IN
    case1 properties in (missing) (missing) folder


    case2 properties (missing) csvs (missing) folder


    Generated 2019-02-08 07:42:12 by Anduril 2.0.0