Calls CNV ratios/gains/losses using the existing programs PanelDoc (developed by Nord et al. 2011) or CNVPanelizer (Bioconductor package), or both, with custom filtering available. The original scripts have been modified, however no functional changes have been made. Both programs use depth of coverage to call CNVs from targeted sequencing data. It is recommended to study the original documentations prior to running this component. All comments and recommendations described here are based on personal experience from using the programs and does not represent the authors/creators opinions.
All parameters needed for PanelDoc is marked with 'PanelDoc' and all parameters for CNVPanelizer with 'CNVPanelizer'.
PanelDoc
The original description of the scripts implemented in this component is created by Nord et al. and are described in the article 'Accurate and exact CNV identification from targeted high-throughput sequence data, Nord et al., 2011 (Pubmed ID: 21486468)' and github repository: https://github.com/ammawla/PanelDoC. The component also includes some custom made analysis of the PanelDoc output: run_Filtering, run_per_gene and run_Combine_results. To make the usage of the scripts easier, the input structure described in the github readme is different in this anduril component, however the names match the original inputs so the contents are the same.
To run the component you will need a set of input files and two folders. These two folders need to be named 'coverage' and 'Genome'. In the coverage folder you should have all coverage files needed in the anasysis, meaning one coverage file for each region in each sample in the format [Samplename]_[regionname].depth. The extension needs to be .depth, the sample name need to match the sample name in the sample-file (see parameter 'samples') and the region name needs to match the RegionName in the partition file (see parameter 'partition_regions'). For example: Patient01_tissue01_cancer_geneXX.depth. The contents of the file should be three columns: chromosome (format: chr1), position and coverage. Each targeted base needs to be included. In the 'Genome' folder you need to have fasta files for each chromosome separately (chr1.fa, chr2.fa, etc.). The rest of the inputs are defined in the parameters section, but shortly you will need a bedfile, partition file ('partition_regions', this can be created from the bedfile), a list of samples ('samples') and known CNVs ('knowncnvs') and some additional annotation files that can be downloaded from UCSC genome browser ('chain_self', 'refFlat'). You will also need to define which parts of the analysis you want to run ('run_PanelDoc', 'run_Filtering', 'run_per_gene', 'run_Combine_results', 'calling_algorithm', 'run_CNVPanelizer') and minimun median coverage for you data ('minimum_coverage'). This is an important numerical parameter, the rest can be run with default values.
In the component results folder 8 subfolders will be created as outputs. The most important one is the 'calls' folder where the final copy number ratios will be written, as well as possible outputs from additional filtering, combined results etc. Another important file is the 'Platform_basedata.csv'. The creation of this file is time consuming, so if you need to run the same set of samples several times, or if the pipeline crashed AFTER (marked by the printed message "Platform data prepared") creating this file you can simply fetch the file from the outputs and give it as the input parameter 'partition_platform'. This will save time since the program doesn't need to start from the beginning.
The pipeline is quite time consuming (can take from 10 minutes up to two weeks) and requires some memory (For a set of 10-15 samples with 500 regions ~40GB). Having fewer regions reduces the running time, however we do not recommend to run only one or two chromosomes at a time since large CNVs will affect the normalization if the set is too small. The pipeline have been tested wih the 'Tumor' pipeline, and have been used succesfully used with default parameters. The 'Blood' pipeline runs without errors, however no further analysis of the pipeline has been done. Creating gene graphics has not been tested. Most problems that might arise during execution is due to issues with the input files, so please make sure to follow the instructions and test your data with a smaller sample set at first. Make sure to always add the last slash to the paths. Since the program does not use a pool of normal samples, normal samples can be run with the cancer samples. In this case you can filter the cancer samples against the normal samples at the end, removing unreliable signal. This does, however, not remove any possible germline variants present in the normal samples. The filtering is only possible with the 'Tumor' pipeline, with 2 mclust_states and if the names of the normal samples match exactly the cancer samples apart from the extension "normal" instead of "cancer". It is recommended to also check the original calls when analysing the data since the filtering might not be reliable for any other set of data.
CNVPanelizer
This is originally a R-package from bioconductor (https://bioconductor.org/packages/release/bioc/html/CNVPanelizer.html, Oliveira C, Wolf T (2018). CNVPanelizer: Reliable CNV detection in targeted sequencing applications. R package version 1.12.0.). For CNVPanelizer you need a folder with cancer bam files, a folder with normal bam files (these do not need to be matched since the program uses a pool of normals), and a bedfile.
Checklist for using the component:
In the parameter description the parameters are numbered.
1. Define parameters 1-5. Which algorithm do you want to use, and how do you want to process the outputs?
2. Based on your choices in step one, create and define the needed inputs for parameters 6-20.
3. Define parameters 21-29. Is this the first run or do you already have some of the intermediate files?
4. Define parameters 30-50. Default parameters can probably be used.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | Analysis |
Authors | Ingrid Schulman (Anna.Schulman@helsinki.FI) |
Issue tracker | View/Report issues |
Requires | biocLite (R-package) ; install.sh (bash) |
Source files | component.xml comp_initialize_changed.R |
Usage | Example with default values |
Name | Type | Description |
---|---|---|
bedgraph | Latex | Output folder for bedgraphs. |
calls | Latex | Output folder for CNV ratio calls, filtered calls (if present), per gene results (if present), and combined CNV ratios from PanelDoc and CNVPanelizer (if present). |
raw | Latex | Output folder for raw data. |
normalized | Latex | Output folder for normalized data.Can be used as input for other segmentation algorithms. |
PDFs | Latex | Output folder for graphical content and normalized CNV calls. |
QC_Metrics | Latex | Output folder for quality control. |
General_output | Latex | Output folder with general information from the run. |
CNVPanelizer_results | Latex | Output folder for CNVPanleizer results. |
Name | Type | Default | Description |
---|---|---|---|
Genome | string | "" | 7. PanelDoc: Folder containing fasta file for each chromosome separately in the format chr1.fa, chr2.fa, ..., chrX.fa. |
Remove_duplicates | boolean | false | 20. CNVPanelizer: Should duplicates be removed or not. true or false, true only recommended for Ion Torrent data. |
annotate_cnvs | string | "TRUE" | 29. PanelDoc: TRUE/FALSE.If TRUE, annotate CNVs to genes. |
bams | string | "" | 17. CNVPanelizer: Path to directory with bam and bai files of cancer samples to be included in run. Bai- files can be easily created with samtools. |
bedfile | string | "" | 19. CNVPanelizer: Path to bedfile used in CNVPanelizer. |
bedfile2 | string | "" | 12. PanelDoc: Path to bedfile used in PanelDoc. This file needs to contain ALL regions present in the analysis and NO OTHER non-targeted or skipped regions. Need to have at least the following columns: chromosome (format: "chr1"), start, end, name, score and strand, without a header. We recommend using geneIDs (ENSXXX.YY) as names. For more information see the PanelDoc documentation and bed format standards. |
call_cnvs | string | "TRUE" | 28. PanelDoc: TRUE/FALSE. If TRUE, call CNVs using sliding window method. If not TRUE, no CNV calling performed. |
calling_algorithm | string | "Tumor" | 11. PanelDoc: Which calling algorithm to be used. For tumor derived data (tissue, ctDNA) use option "Tumor". For blood use "Blood". |
chain_self | string | "" | 16. PanelDoc: Path to chainSelfLink file. This can be dowloaded from the UCSC genome browser. |
coverage | string | "" | 8. PanelDoc: Folder containing coverage for each combination of sample and region, in the format [Samplename]_[regionname].depth. The contents of the file should be three columns: chromosome (format: chr1), position and coverage. Each targeted base needs to be included. |
gain_call_signal | float | 1.3 | 42. PanelDoc: Signal criteria for gain call. |
gain_extend_signal | float | 1.2 | 40. PanelDoc: Expected signal for gain seed extension. |
gain_many_call_signal | float | 1.8 | 36. PanelDoc: Signal criteria for gain call. |
gain_many_extend_signal | float | 1.7 | 34. PanelDoc: Expected signal for gain seed extension. |
gain_many_seed_signal | float | 1.8 | 32. PanelDoc: Expected signal for gain seed generation. |
gain_seed_signal | float | 1.3 | 38. PanelDoc: Expected signal for gain seed generation. |
gene_graphics | string | "FALSE" | 23. PanelDoc: TRUE/FALSE. If TRUE, gene is plotted in graphical output. This is only possible if a) Partition regions represent genes b) There is a PartitionRefseq-column in the partition file with annotation information. Creating gene graphics has not been validated in this component. |
genelist_for_combining | string | "" | 6. Path to list of matched gene names and gene IDs. The PanelDoc pipeline might get confused if there are gene names with the same locations. Because of this we recommend to use gene IDs (ENSXXXX.YY) as gene names in the bedfile and partition file (parameters 12-13). This however requires a csv file of matched gene names and gene IDs to combine the results with CNVPanelizer. Please include the following columns: 'Chr' (format chrX), 'Start', 'End', 'Gene' and 'GeneID'. For example Gene = 'BRCA2', 'GeneID' ='ENSG00000139618.14'. File should be tab separated. |
generate_median | string | "TRUE" | 24. PanelDoc: TRUE/FALSE. Should raw median coverage be generated. |
knowncnvs | string | "" | 14. PanelDoc: Path to file with known CNVs. If no known cnvs, leave this parameter unspecified. If there are known CNVs specifying them will improve normalization, especially if the regions are large. |
loss_call_signal | float | .7 | 43. PanelDoc: Signal criteria for loss call. |
loss_extend_signal | float | .8 | 41. PanelDoc: Expected signal for loss seed extension. |
loss_none_call_signal | float | .2 | 37. PanelDoc: Signal criteria for loss call. |
loss_none_extend_signal | float | .3 | 35. PanelDoc: Expected signal for loss seed extension. |
loss_none_seed_signal | float | .2 | 33. PanelDoc: Expected signal for loss seed generation. |
loss_seed_signal | float | .7 | 39. PanelDoc: Expected signal for loss seed generation. |
max_uncertainty | float | .1 | 49. PanelDoc: Maximum permitted uncertainty for state calling using mclust for tumors. Lower = more stringent. Not relevant for sliding window CNV calling. |
maximum_selfchain | int | 2 | 31. PanelDoc: Set maximum self-chain repeat count for base inclusion in variant calling. |
mclust_states | int | 2 | 50. PanelDoc: Maximum number of states for mixture model used in 'Tumor' pipeline. Max is 9. Analysis optimized for 2 states. More states likely to generate false positives. |
minimum_bait | int | 1 | 30. PanelDoc: Set minimum distance from non-targeted region for base inclusion in variant calling. |
minimum_base_pass | int | 180 | 46. PanelDoc: Minimum base count that pass criteria for call. |
minimum_base_window | int | 200 | 45. PanelDoc: Minimum window base count for call. |
minimum_coverage | string | "" | 10. PanelDoc: Minimum median coverage to base inclusion in variant calling. Change this to appropriate coverage according to your data. |
minimum_zscore | int | 1 | 44. PanelDoc: Minimum z score for gc content normalization. |
normals | string | "" | 18. CNVPanelizer:Path to directory with bam and bai files of normal samples to be included in the pool of normals. Bai- files can be easily created with samtools. |
output_bedgraph | string | "FALSE" | 25. PanelDoc: TRUE/FALSE. If TRUE, bedgraphs are written for raw coverage and normalized ratio data. |
output_normalized | string | "TRUE" | 26. PanelDoc: TRUE/FALSE. If TRUE, output data is written for raw and normalized coverage |
output_ratio | string | "TRUE" | 27. PanelDoc: TRUE/FALSE. If TRUE, ratio table is written for each partition. This is useful if another segmentation method is to be used. |
partition_platform | string | "" | 22. PanelDoc: Path to the genomic data for bed platform generated previously if existing (see parameter prep_partition), else leave empty. |
partition_regions | string | "" | 13. PanelDoc: Path to partition file. This is a file that specifies the discrete regions to be included in the run. The file should contain the following columns: PartitionName,PartitionChr,PartitionStart,PartitionEnd and optionally the column PartitionRefseq with a RefSeq ID which is used for gene_graphics. Needs to contain at least two regions. We recommend to use geneIDs as names, and these names need to match the names used in the bedfile. |
pass_count | int | 37 | 48. PanelDoc: Number of bases in window required to pass criteria for call. |
prep_partition | string | "TRUE" | 21. PanelDoc: TRUE/FALSE. If TRUE, genomic data for bed platform is generated. This step is the most time consuming in PanelDoc. Once this data has been generated for the samples and regions in use, set to FALSE and specify the prep_partition_platform parameter as the path to the file. |
refFlat | string | "" | 15. PanelDoc: Path to file with gene annotations. This can be dowloaded from the UCSC genome browser. |
run_CNVPanelizer | string | "TRUE" | 4. CNVPanelizer: Run CNVPanelizer analysis. |
run_Combine_results | string | "FALSE" | 5. Combine results from PanelDoc and CNVPanelizer. This will automatically convert PanelDoc results to per gene before combining. This step requires the parameter 'genelist_for_combining' to be specified. |
run_Filtering | string | "TRUE" | 2. PanelDoc: Filter out unreliable signal from cancer samples based on normal samples in the same run. This step requires normal samples present in the batch of samples to be run and that these samples have the exact same name as the cancer sample with the extension normal.[extension] as in comparison to cancer samples with the extension cancer.[extension]. This requires parameter 50 to be set to default value 2. |
run_PanelDoc | string | "TRUE" | 1. PanelDoc: Run PanelDoc analysis. |
run_per_gene | string | "TRUE" | 3. PanelDoc: Output a table with all samples from the batch with results given per gene. Otherwise results will be given for regions smaller than genes. The per_gene results is a weighted mean based on region size. |
samples | string | "" | 9. PanelDoc: Path to CSV file with sample names to be included in the run, without extensions. Example: Patient01_sample01_cancer. File consists of header "Sample" followed by each sample name on a separate line. Make sure there are no extra spaces in the file. Needs to be a csv file and there needs to be at least two samples in a run. This can also have additional columns "Lane" and "Index", please see original documentation for more information. |
window_size | int | 50 | 47. PanelDoc: Window size for scanning. |
Test case | Parameters▼ | OUT bedgraph |
OUT calls |
OUT raw |
OUT normalized |
OUT PDFs |
OUT QC_Metrics |
OUT General_output |
OUT CNVPanelizer_results |
---|---|---|---|---|---|---|---|---|---|
Test_PanelDoc | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |