This function will apply machine learning in order to improve the input variants. The input variant files are assumed to be "raw" in the sense, that they are straight from the caller. The Genome Analysis Toolkit (GATK) is used along with several background ("true site") files that can be downloaded the GATK resource bundle. For more information about the specific annotations available, please see the GATK documentation. The default annotations used here are the recommended ones for most data sets.
The two step procedure of VariantRecalibrator:
Complete documentation:
Also check out the additional discussion on VQSR and the FAQ describing the recommended arguments and training sets.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | VariationAnalysis |
Authors | Rony Lindell (rony.lindell@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml function.scala |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
reference | FASTA | Mandatory | The reference fasta file. |
variants | VCF | Optional | Input (merged) vcf file. See 'files' parameter for adding multiple files.
The file can be a single-sample or a merged multi-sample vcf file. |
hapmap | VCF | Optional | File with very high confidence hapmap training data. |
omni | VCF | Optional | File with true polymorphic SNP sites from the Omni genotyping array. |
hcsnp | VCF | Optional | File with high confidence snps from the 1000 Genomes project. |
dbsnp | VCF | Optional | File with lower confidence SNPs from latest dbSNP distribution. |
mills | VCF | Optional | File with indel high confidence training data from the Mills dataset. |
Name | Type | Description |
---|---|---|
calls | VCF | Final recalibrated vcf file. |
Name | Type | Default | Description |
---|---|---|---|
capture | boolean | true | This will make various parameters specific for exome sequencing (or other similar "capture" technology). If 'false', the data will be assumed to be whole-genome or similar. |
files | string | "" | A "-input"-tag separated list of paths to multiple vcf files (single- or multi-sample), e.g. files="-input FILE1.vcf -input FILE2.vcf, ... -input FILEN.vcf". |
gatk | string | "" | Path to GATK directory containing the 'GenomeAnalysisTK.jar' file. If empty string is given (default), GATK_HOME environment variable is assumed to point to the GATK directory where GenomeAnalysisTK.jar is located. |
indelAnno | string | "QD,FS,HaplotypeScore,ReadPosRankSum,InbreedingCoeff,MQRankSum" | Names of the annotations that will be used in the indel model given in a comma-separated list. Note that MQ (RMS mapping quality) and MQRankSum should usually be left out in the indel model. |
memory | string | "4g" | The amount of java-heap memory being allocated to the GATK thread, given in the format "4g" for 4 gigabytes or "2560m" for 2560 megabytes (2,5g) etc. |
snpAnno | string | "QD,HaplotypeScore,MQRankSum,ReadPosRankSum,FS,MQ,InbreedingCoeff" | Names of the annotations that will be used in the snp model given in a comma-separated list. Note that DP (depth of coverage) should not be used for capture data (e.g. exome). DP annotation will however be automatically added to the list when 'capture' is false. |
threads | int | 1 | The amount parallelized threads that are allocated to each run. |
truth | float | 99.0 | Level of true variant probability at which to start filtering. A lower value should add to the sensitivity but decrease the specifity. |
Test case | Parameters▼ | IN reference |
IN variants |
IN hapmap |
IN omni |
IN hcsnp |
IN dbsnp |
IN mills |
OUT calls |
---|---|---|---|---|---|---|---|---|---|
case1 | properties | reference | variants | hapmap | omni | (missing) | dbsnp | mills | (expecting failure) |
# Run using less memory and simple annotations, |