Quality control function for high-throughput sequencing data. It takes an array of FASTQ paired or single-end reads, trims them and filters the low quality sequences. Used in whole genome (WGS), including bisulfite, or targeted (exome, RNA-seq, scRNA-seq) sequencing analyses. The function carries out the following steps:
Most parameters work with all tools in the same way, but some have slightly different behavior (ex. stringency) or are specific to one of the tools.
QCFasta may have issues creating the quality report if the keys for the samples are single numbers from 1-10 (it is recommended to use keys that are clearly a string, i.e. not just numbers). If the fastQCfolders input is used then the parameters readkey and mateKey need to be set. These parameters are a substring of the names FastQC gave to the folders containing the reads and mates statistics. Usually reads have "_1.fq" ending while mates end in "_2.fq". In that case readKey="_1" and mateKey="_2". If you used SeqQC then the word "read" and "mate" were added to the folder names and in that case you do not need to modify readKey and mateKey since "read" and "mate" are the default values for them.
Overview of tricky parameters: More accurate descriptions of the parameters can be found in the documentation of each tool or on the individual components for Trimmomatic, TrimGalore, and Fastx (SmallRNAprep).
Version | 5.0 |
---|---|
Bundle | sequencing |
Categories | Quality control |
Authors | Alejandra Cervera (alejandra.cervera@helsinki.fi), Erkka Valo (erkka.valo@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml function.scala |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
reads | Array<FASTQ> | Mandatory | An array of reads. |
mates | Array<FASTQ> | Optional | An array of mates for the reads if the data is paired-end. The matching of reads to mates is done by the array keys. |
fastQCfolders | Array<BinaryFolder> | Optional | If FastQC output is available you can skip running it again by providing the output folders as an array. |
adapter | FASTA | Optional | Adapter file in fasta format that can only be used with trimmomatic, either single-cell or bulk. |
Name | Type | Description |
---|---|---|
qcReads | Array<FASTQ> | An array of sequences that passed the quality control step |
qcMates | Array<FASTQ> | An array sequences that passed the quality control step |
report | HTML | Quality control report |
table | CSV | Statistics of all processed samples. |
qcUnpairedReads | Array<FASTQ> | An array of sequences of which only the read passed the quality control step |
qcUnpairedMates | Array<FASTQ> | An array of sequences of which only the mate passed the quality control step |
Name | Type | Default | Description |
---|---|---|---|
adapterSeq | string | "" | Adapter specified directly as a string for TrimGalore or FastX; for Trimmomatic you can either provide the fasta file as input or specify here the Illumina adapter to use: TruSeq2-SE.fa, TruSeq2-PE.fa, TruSeq3-SE.fa, or TruSeq3-PE.fa. |
crop | int | -1 | Trim bases at the end of the read so it maximally has the crop size, only for Trimmomatic and FastX. The default value (-1) only works for Trimmomatic, for FastX it needs to be set to the length desired (ex. 32). |
extra | string | "" | Extra parameters for trim Galore! or for FastX. |
gzip | boolean | false | Defines if the output sequences should be gzipped or not. |
headcrop | int | 0 | The number of bases to remove from the start of the read. No trimming is done if the value is set to 0. |
isSinglecell | boolean | false | Define if the read files are from Linnarson's single cell RNA-seq protocol (STRT). |
keepBothReads | boolean | false | Defines if keep the reverse reads after read-though has been detected by palindrome mode, and the adapter sequence removed, the reverse read contains the same sequence information as the forward read, albeit in reverse complement. Only for Trimmomatic. |
mateKey | string | "mate" | Key to identify mates from reads in the FastQCfolders |
minLength | int | 20 | Reads shorter than minLength will be removed. In paired-end sequencing also the corresponding mate is removed. No trimming is done if the value is set to 0. |
minPercent | int | 20 | Minimum percentage of bases that must have at least minQuality for a read to be kept. Only FastX. |
minQuality | int | 20 | Bases below this quality threshold will be trimmed from the 5' end of the sequence. |
palindromeClip | int | 30 | Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment. Only Trimmomatic. |
percent | float | 0.3 | Percentage of good quality reads needed to keep the file. |
qual | string | "" | Quality version used by the sequencer (phred33 or phred64). If emtpy we use FastQC to guess the encoding. |
readKey | string | "read" | Key to identify reads from mates in the FastQCfolders |
simpleClip | int | 12 | A threshold specifies how accurate the match between any adapter must be against a read. Each matching base adds just over 0.6. Only Trimmomatic. |
slidingWindow | string | "null" | Sliding window trimming where the sequence is cut if the average quality of the bases within the sliding window falls below the defined threshold. A string specifies the window size and the average required quality in the sliding window. The format is windowSize:requiredQuality. For example, 4:15 (window size = 4; required quality = 15). No trimming is done if value is set to 'null'. Only Trimmomatic. |
stringency | int | 2 | Minimum overlap of sequence with the adapter for the bases to be trimmed (TrimGalore and FastX); allowed mismatches with the adaptor (Trimmomatic). |
temSwitPrimer | string | "GGG" | minimal template-switching generated Gs |
threads | int | 2 | Number of threads to use for the multi-threading components. |
tool | string | "trimmomatic" | Choose trimmomatic, trimGalore or fastx for adapter removal and quality trimming |
trailing | int | 30 | Remove bases from the end of the read, if quality value is below the given threshold. Only Trimmomatic. |
umiLen | int | 6 | The length of unique molecular identifiers (UMIs), only used when isSingleCell is true. |
umiSliding | string | "1:17" | slidingWindow for UMIs. For example, when UMIsliding = 1:17, any UMI bases with a quality lower than 17 will be removed. Only used when isSingleCell is true. |
Test case | Parameters▼ | IN reads |
IN mates |
IN fastQCfolders |
IN adapter |
OUT qcReads |
OUT qcMates |
OUT report |
OUT table |
OUT qcUnpairedReads |
OUT qcUnpairedMates |
---|---|---|---|---|---|---|---|---|---|---|---|
case1 | properties | reads | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
readKey=_1, |
|||||||||||
case2 | properties | reads | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
readKey=_1, |
|||||||||||
case3 | properties | reads | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
readKey=_1, |
|||||||||||
case4 | properties | reads | mates | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
readKey =_reads, |
|||||||||||
case5 | properties | reads | mates | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
readKey =_reads, |
|||||||||||
case6 | properties | reads | mates | fastQCfolders | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
tool=trimmomatic |