FastQScreen allows you to screen a library of sequences in FastQ format against a set of sequence databases, for example vectors, virus or ribosomal RNA, so you can see if the composition of the library matches with what you expect.
FastQScreen is intended to be used as part of a QC pipeline. It allows you to take a sequence dataset and search it against a set of bowtie databases. It will then generate both a text and a graphical summary of the results to see if the sequence dataset contains the kind of sequences you expect.
Ideally the output will show a high percentage of reads that did not align to the sequences provided in dbList (since most of your reads should align to the genome of interest which should not be included in dbList).
To parallelize the component execution use the "custom_cpu" metadata annotation.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | Preprocessing |
Authors | Gabriele Partel (gabrielepartel@gmail.com) |
Issue tracker | View/Report issues |
Requires | Bowtie ; Bowtie2 ; GD::Graph ; installer (bash) |
Source files | component.xml main.sh |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
dbList | CSV | Mandatory | CSV tab-separated file that allows you to configure multiple databases to search against in your screen. For each database you need to provide a database name (which can't contain spaces) and the location of the bowtie indices which you created for that database. |
reads | FASTQ | Mandatory | Reads in FASTQ format. |
mates | FASTQ | Optional | Mates in FASTQ format. |
Name | Type | Description |
---|---|---|
folder | BinaryFolder | Output folder. |
NoHitPercentage | CSV | Percentage of reads that didn't align to the genomes provided |
Name | Type | Default | Description |
---|---|---|---|
aligner | string | "bowtie2" | Specify the aligner to use for the mapping. Valid arguments are 'bowtie' or 'bowtie2'. |
bisulfite | boolean | false | true when processing bisulfite libraries. Either conventional or bisulfite libraries may be specified, but not both simultaneously. |
illumina1_3 | boolean | false | If true assumes that the quality values are in encoded in Illumina v1.3 format. Defaults to Sanger format if false. |
nohits | boolean | false | If true writes to a file the sequences that did not map to any of the specified genomes. If the subset option is also specified, only reads from the temporary dataset that failed to align to the reference genomes will be written to the output file. |
subset | int | 100000 | Don't use the whole sequence file, but create a temporary dataset of this specified number of reads. The dataset created will be of approximately (within a factor of 2) of this size. If the real dataset is smaller than twice the specified size then the whole dataset will be used. Subsets will be taken evenly from throughout the whole original dataset. (To process all the data set to 0). |
Test case | Parameters▼ | IN dbList |
IN reads |
IN mates |
OUT folder |
OUT NoHitPercentage |
---|---|---|---|---|---|---|
case1 | properties | dbList | reads | mates | (missing) | (missing) |
case2 | (missing) | dbList | reads | (missing) | (missing) | (missing) |
case3 | properties | dbList | reads | (missing) | (missing) | (missing) |
nohits=true |