Up: Component summary Component

PanelDoc

Calls CNV ratios/gains/losses using the existing programs PanelDoc (developed by Nord et al. 2011) or CNVPanelizer (Bioconductor package), or both, with custom filtering available. The original scripts have been modified, however no functional changes have been made. Both programs use depth of coverage to call CNVs from targeted sequencing data. It is recommended to study the original documentations prior to running this component. All comments and recommendations described here are based on personal experience from using the programs and does not represent the authors/creators opinions.

All parameters needed for PanelDoc is marked with 'PanelDoc' and all parameters for CNVPanelizer with 'CNVPanelizer'.

PanelDoc

The original description of the scripts implemented in this component is created by Nord et al. and are described in the article 'Accurate and exact CNV identification from targeted high-throughput sequence data, Nord et al., 2011 (Pubmed ID: 21486468)' and github repository: https://github.com/ammawla/PanelDoC. The component also includes some custom made analysis of the PanelDoc output: run_Filtering, run_per_gene and run_Combine_results. To make the usage of the scripts easier, the input structure described in the github readme is different in this anduril component, however the names match the original inputs so the contents are the same.

To run the component you will need a set of input files and two folders. These two folders need to be named 'coverage' and 'Genome'. In the coverage folder you should have all coverage files needed in the anasysis, meaning one coverage file for each region in each sample in the format [Samplename]_[regionname].depth. The extension needs to be .depth, the sample name need to match the sample name in the sample-file (see parameter 'samples') and the region name needs to match the RegionName in the partition file (see parameter 'partition_regions'). For example: Patient01_tissue01_cancer_geneXX.depth. The contents of the file should be three columns: chromosome (format: chr1), position and coverage. Each targeted base needs to be included. In the 'Genome' folder you need to have fasta files for each chromosome separately (chr1.fa, chr2.fa, etc.). The rest of the inputs are defined in the parameters section, but shortly you will need a bedfile, partition file ('partition_regions', this can be created from the bedfile), a list of samples ('samples') and known CNVs ('knowncnvs') and some additional annotation files that can be downloaded from UCSC genome browser ('chain_self', 'refFlat'). You will also need to define which parts of the analysis you want to run ('run_PanelDoc', 'run_Filtering', 'run_per_gene', 'run_Combine_results', 'calling_algorithm', 'run_CNVPanelizer') and minimun median coverage for you data ('minimum_coverage'). This is an important numerical parameter, the rest can be run with default values.

In the component results folder 8 subfolders will be created as outputs. The most important one is the 'calls' folder where the final copy number ratios will be written, as well as possible outputs from additional filtering, combined results etc. Another important file is the 'Platform_basedata.csv'. The creation of this file is time consuming, so if you need to run the same set of samples several times, or if the pipeline crashed AFTER (marked by the printed message "Platform data prepared") creating this file you can simply fetch the file from the outputs and give it as the input parameter 'partition_platform'. This will save time since the program doesn't need to start from the beginning.

The pipeline is quite time consuming (can take from 10 minutes up to two weeks) and requires some memory (For a set of 10-15 samples with 500 regions ~40GB). Having fewer regions reduces the running time, however we do not recommend to run only one or two chromosomes at a time since large CNVs will affect the normalization if the set is too small. The pipeline have been tested wih the 'Tumor' pipeline, and have been used succesfully used with default parameters. The 'Blood' pipeline runs without errors, however no further analysis of the pipeline has been done. Creating gene graphics has not been tested. Most problems that might arise during execution is due to issues with the input files, so please make sure to follow the instructions and test your data with a smaller sample set at first. Make sure to always add the last slash to the paths. Since the program does not use a pool of normal samples, normal samples can be run with the cancer samples. In this case you can filter the cancer samples against the normal samples at the end, removing unreliable signal. This does, however, not remove any possible germline variants present in the normal samples. The filtering is only possible with the 'Tumor' pipeline, with 2 mclust_states and if the names of the normal samples match exactly the cancer samples apart from the extension "normal" instead of "cancer". It is recommended to also check the original calls when analysing the data since the filtering might not be reliable for any other set of data.

CNVPanelizer

This is originally a R-package from bioconductor (https://bioconductor.org/packages/release/bioc/html/CNVPanelizer.html, Oliveira C, Wolf T (2018). CNVPanelizer: Reliable CNV detection in targeted sequencing applications. R package version 1.12.0.). For CNVPanelizer you need a folder with cancer bam files, a folder with normal bam files (these do not need to be matched since the program uses a pool of normals), and a bedfile.

Checklist for using the component:

In the parameter description the parameters are numbered.

1. Define parameters 1-5. Which algorithm do you want to use, and how do you want to process the outputs?

2. Based on your choices in step one, create and define the needed inputs for parameters 6-20.

3. Define parameters 21-29. Is this the first run or do you already have some of the intermediate files?

4. Define parameters 30-50. Default parameters can probably be used.

Version 1.0
Bundle sequencing
Categories Analysis
Authors Ingrid Schulman (Anna.Schulman@helsinki.FI)
Issue tracker View/Report issues
Requires biocLite (R-package) ; install.sh (bash)
Source files component.xml comp_initialize_changed.R
Usage Example with default values

Outputs

Name Type Description
bedgraph Latex Output folder for bedgraphs.
calls Latex Output folder for CNV ratio calls, filtered calls (if present), per gene results (if present), and combined CNV ratios from PanelDoc and CNVPanelizer (if present).
raw Latex Output folder for raw data.
normalized Latex Output folder for normalized data.Can be used as input for other segmentation algorithms.
PDFs Latex Output folder for graphical content and normalized CNV calls.
QC_Metrics Latex Output folder for quality control.
General_output Latex Output folder with general information from the run.
CNVPanelizer_results Latex Output folder for CNVPanleizer results.

Parameters

Name Type Default Description
Genome string "" 7. PanelDoc: Folder containing fasta file for each chromosome separately in the format chr1.fa, chr2.fa, ..., chrX.fa.
Remove_duplicates boolean false 20. CNVPanelizer: Should duplicates be removed or not. true or false, true only recommended for Ion Torrent data.
annotate_cnvs string "TRUE" 29. PanelDoc: TRUE/FALSE.If TRUE, annotate CNVs to genes.
bams string "" 17. CNVPanelizer: Path to directory with bam and bai files of cancer samples to be included in run. Bai- files can be easily created with samtools.
bedfile string "" 19. CNVPanelizer: Path to bedfile used in CNVPanelizer.
bedfile2 string "" 12. PanelDoc: Path to bedfile used in PanelDoc. This file needs to contain ALL regions present in the analysis and NO OTHER non-targeted or skipped regions. Need to have at least the following columns: chromosome (format: "chr1"), start, end, name, score and strand, without a header. We recommend using geneIDs (ENSXXX.YY) as names. For more information see the PanelDoc documentation and bed format standards.
call_cnvs string "TRUE" 28. PanelDoc: TRUE/FALSE. If TRUE, call CNVs using sliding window method. If not TRUE, no CNV calling performed.
calling_algorithm string "Tumor" 11. PanelDoc: Which calling algorithm to be used. For tumor derived data (tissue, ctDNA) use option "Tumor". For blood use "Blood".
chain_self string "" 16. PanelDoc: Path to chainSelfLink file. This can be dowloaded from the UCSC genome browser.
coverage string "" 8. PanelDoc: Folder containing coverage for each combination of sample and region, in the format [Samplename]_[regionname].depth. The contents of the file should be three columns: chromosome (format: chr1), position and coverage. Each targeted base needs to be included.
gain_call_signal float 1.3 42. PanelDoc: Signal criteria for gain call.
gain_extend_signal float 1.2 40. PanelDoc: Expected signal for gain seed extension.
gain_many_call_signal float 1.8 36. PanelDoc: Signal criteria for gain call.
gain_many_extend_signal float 1.7 34. PanelDoc: Expected signal for gain seed extension.
gain_many_seed_signal float 1.8 32. PanelDoc: Expected signal for gain seed generation.
gain_seed_signal float 1.3 38. PanelDoc: Expected signal for gain seed generation.
gene_graphics string "FALSE" 23. PanelDoc: TRUE/FALSE. If TRUE, gene is plotted in graphical output. This is only possible if a) Partition regions represent genes b) There is a PartitionRefseq-column in the partition file with annotation information. Creating gene graphics has not been validated in this component.
genelist_for_combining string "" 6. Path to list of matched gene names and gene IDs. The PanelDoc pipeline might get confused if there are gene names with the same locations. Because of this we recommend to use gene IDs (ENSXXXX.YY) as gene names in the bedfile and partition file (parameters 12-13). This however requires a csv file of matched gene names and gene IDs to combine the results with CNVPanelizer. Please include the following columns: 'Chr' (format chrX), 'Start', 'End', 'Gene' and 'GeneID'. For example Gene = 'BRCA2', 'GeneID' ='ENSG00000139618.14'. File should be tab separated.
generate_median string "TRUE" 24. PanelDoc: TRUE/FALSE. Should raw median coverage be generated.
knowncnvs string "" 14. PanelDoc: Path to file with known CNVs. If no known cnvs, leave this parameter unspecified. If there are known CNVs specifying them will improve normalization, especially if the regions are large.
loss_call_signal float .7 43. PanelDoc: Signal criteria for loss call.
loss_extend_signal float .8 41. PanelDoc: Expected signal for loss seed extension.
loss_none_call_signal float .2 37. PanelDoc: Signal criteria for loss call.
loss_none_extend_signal float .3 35. PanelDoc: Expected signal for loss seed extension.
loss_none_seed_signal float .2 33. PanelDoc: Expected signal for loss seed generation.
loss_seed_signal float .7 39. PanelDoc: Expected signal for loss seed generation.
max_uncertainty float .1 49. PanelDoc: Maximum permitted uncertainty for state calling using mclust for tumors. Lower = more stringent. Not relevant for sliding window CNV calling.
maximum_selfchain int 2 31. PanelDoc: Set maximum self-chain repeat count for base inclusion in variant calling.
mclust_states int 2 50. PanelDoc: Maximum number of states for mixture model used in 'Tumor' pipeline. Max is 9. Analysis optimized for 2 states. More states likely to generate false positives.
minimum_bait int 1 30. PanelDoc: Set minimum distance from non-targeted region for base inclusion in variant calling.
minimum_base_pass int 180 46. PanelDoc: Minimum base count that pass criteria for call.
minimum_base_window int 200 45. PanelDoc: Minimum window base count for call.
minimum_coverage string "" 10. PanelDoc: Minimum median coverage to base inclusion in variant calling. Change this to appropriate coverage according to your data.
minimum_zscore int 1 44. PanelDoc: Minimum z score for gc content normalization.
normals string "" 18. CNVPanelizer:Path to directory with bam and bai files of normal samples to be included in the pool of normals. Bai- files can be easily created with samtools.
output_bedgraph string "FALSE" 25. PanelDoc: TRUE/FALSE. If TRUE, bedgraphs are written for raw coverage and normalized ratio data.
output_normalized string "TRUE" 26. PanelDoc: TRUE/FALSE. If TRUE, output data is written for raw and normalized coverage
output_ratio string "TRUE" 27. PanelDoc: TRUE/FALSE. If TRUE, ratio table is written for each partition. This is useful if another segmentation method is to be used.
partition_platform string "" 22. PanelDoc: Path to the genomic data for bed platform generated previously if existing (see parameter prep_partition), else leave empty.
partition_regions string "" 13. PanelDoc: Path to partition file. This is a file that specifies the discrete regions to be included in the run. The file should contain the following columns: PartitionName,PartitionChr,PartitionStart,PartitionEnd and optionally the column PartitionRefseq with a RefSeq ID which is used for gene_graphics. Needs to contain at least two regions. We recommend to use geneIDs as names, and these names need to match the names used in the bedfile.
pass_count int 37 48. PanelDoc: Number of bases in window required to pass criteria for call.
prep_partition string "TRUE" 21. PanelDoc: TRUE/FALSE. If TRUE, genomic data for bed platform is generated. This step is the most time consuming in PanelDoc. Once this data has been generated for the samples and regions in use, set to FALSE and specify the prep_partition_platform parameter as the path to the file.
refFlat string "" 15. PanelDoc: Path to file with gene annotations. This can be dowloaded from the UCSC genome browser.
run_CNVPanelizer string "TRUE" 4. CNVPanelizer: Run CNVPanelizer analysis.
run_Combine_results string "FALSE" 5. Combine results from PanelDoc and CNVPanelizer. This will automatically convert PanelDoc results to per gene before combining. This step requires the parameter 'genelist_for_combining' to be specified.
run_Filtering string "TRUE" 2. PanelDoc: Filter out unreliable signal from cancer samples based on normal samples in the same run. This step requires normal samples present in the batch of samples to be run and that these samples have the exact same name as the cancer sample with the extension normal.[extension] as in comparison to cancer samples with the extension cancer.[extension]. This requires parameter 50 to be set to default value 2.
run_PanelDoc string "TRUE" 1. PanelDoc: Run PanelDoc analysis.
run_per_gene string "TRUE" 3. PanelDoc: Output a table with all samples from the batch with results given per gene. Otherwise results will be given for regions smaller than genes. The per_gene results is a weighted mean based on region size.
samples string "" 9. PanelDoc: Path to CSV file with sample names to be included in the run, without extensions. Example: Patient01_sample01_cancer. File consists of header "Sample" followed by each sample name on a separate line. Make sure there are no extra spaces in the file. Needs to be a csv file and there needs to be at least two samples in a run. This can also have additional columns "Lane" and "Index", please see original documentation for more information.
window_size int 50 47. PanelDoc: Window size for scanning.

Test cases

Test case Parameters OUT
bedgraph
OUT
calls
OUT
raw
OUT
normalized
OUT
PDFs
OUT
QC_Metrics
OUT
General_output
OUT
CNVPanelizer_results
Test_PanelDoc (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

Generated 2019-02-08 07:42:12 by Anduril 2.0.0