Removes tags from the reads in fastq/fasta files. Specific tags for each end of the read can be given as input in tags file or they can be predicted. Overrepresented sequences found by FastQC (SeqQC component) can be given as input using QCParser overrepresented output. When using -predict the predicted tags are in the file tags.csv in the output folder, when using -stats the information will be saved in stats.csv.
Version | 2.1 |
---|---|
Bundle | sequencing |
Categories | |
Authors | Alejandra Cervera (alejandra.cervera@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml AdaptorRemoval.sh |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
reads | BinaryFile | Mandatory | Input files in FASTQ/FASTA format. |
tags | BinaryFile | Optional | A file with the tags to be trimmed. The file needs to have tag3 or tag5 as the first field and the tag as the second field. |
overSeqs | BinaryFile | Optional | Overrepresented sequences directly obtained from SeqQC component (via QCParser) or file with adaptors or tags that should be trimmed without specifying from which end. Illumina adaptors will be trimmed regardless of the percentage, other overrepresented sequences will only be trimmed if they exceed the percentage parameter. |
Name | Type | Description |
---|---|---|
arrayOut | Array<BinaryFile> | The key "read" points to the trimmed file if trimming was performed or to the unmodified input read file if no trimming occurred, the "stats" and "tags" will point to the corresponding files, they may or may not exist depending on the parameters chosen. |
removalLog | BinaryFile | Log file |
predictLog | BinaryFile | Log file |
Name | Type | Default | Description |
---|---|---|---|
extra | string | "" | Give one or more options as a string, e.g.
extra="-trim_within" .
More options on the Tag Cleaner documentation site, or use extra=-help to see the options
With this option you can use all of the parameters from the TagCleaner script,
but right now it can only be used when trimming, i.e you cannot use it during the prediction step. |
fastq | boolean | true | Defines if the input file is FASTQ (true) or FASTA (false). |
matrix | string | "exact" | union, subset, or exact depending on how to match the tags to the sequences |
mm3 | int | 1 | Maximum number of allowed mismatches at the 3'-end. The independent definition for the 5'- and 3'-end of the reads accounts for the differences in tag sequences due to the limitations of the sequencing method used to generate the datasets. The 3'-end will in most cases show a lower number of matching tag sequences with low number of mismatches due to incomplete or missing tags at the ends of incompletely sequenced fragments. [default: 0] |
mm5 | int | 1 | Maximum number of allowed mismatches at the 5'-end. [default: 0] |
out_format | int | 0 | To change the output format, use one of the following options. If not defined, the output format will be the same as the input format. 1 (FASTA only), 2 (FASTA and QUAL) or 3 (FASTQ) |
percentage | int | 15 | The tags have to be overrepresented by this percentage to be trimmed. |
predict | boolean | false | Use this option to have TagCleaner predict the tag sequences. It will attempt to predict the tag at either or both sites, if possible. The algorithm implemented for the tag prediction assumes the randomness of a typical metagenome. Datasets that do not contain random sequences from organisms in an environment, but rather contain, for example, 16S data may cause incorrect detection of the tag sequences. However, the tag sequences will most likely be over-predicted and can be redefined by the user prior to data processing. The tag sequence prediction uses filtered base frequencies instead of raw base frequencies. This allows a more accurate prediction as it accounts for incomplete and shifted tag sequences. The output values are separated by tabs with the header line: "#Param Tag_Sequence Tag_Length Percent_Explained". If no tags are reported, then no tags could be identified in the data set. Cannot be used in combination with -tag3 or -tag5 or -stats. When using this option, no trimming will be performed. |
stats | boolean | false | Prints the number of tag sequences matching for different numbers of mismatches. In combination with -split, the number of sequences with fragment-to-fragment concatenations is printed as well. The output values are separated by tabs with the header line: "#Param Mismatches_or_Splits Number_of_Sequences Percentage Percentage_Sum". Cannot be used in combination with -predict and require -tag5 or -tag3. If predict is true, it will predict first and then obtains the stats; it cannot be used in combination with trim, so if trim is also true it will only run stats. |
tag3 | boolean | true | Set to true if tag sequence at 3'-end will be trimmed. |
tag5 | boolean | true | Set to true if tag sequence at 5'-end will be trimmed. |
trim | boolean | true | If true then it trims the tags in the read_tags file. |
verbose | boolean | false | Prints status and info messages during processing. |
Test case | Parameters▼ | IN reads |
IN tags |
IN overSeqs |
OUT arrayOut |
OUT removalLog |
OUT predictLog |
---|---|---|---|---|---|---|---|
case1 | properties | reads | (missing) | (missing) | arrayOut | (missing) | (missing) |
# Testing AdaptorRemoval component, |
|||||||
case2 | properties | reads | (missing) | (missing) | (missing) | (missing) | (missing) |
# Testing AdaptorRemoval component, |
|||||||
case3 | properties | reads | tags | (missing) | arrayOut | (missing) | (missing) |
# Testing AdaptorRemoval component, |