Up: Component summary Component


Removes tags from the reads in fastq/fasta files. Specific tags for each end of the read can be given as input in tags file or they can be predicted. Overrepresented sequences found by FastQC (SeqQC component) can be given as input using QCParser overrepresented output. When using -predict the predicted tags are in the file tags.csv in the output folder, when using -stats the information will be saved in stats.csv.

Version 2.1
Bundle sequencing
Authors Alejandra Cervera (alejandra.cervera@helsinki.fi)
Issue tracker View/Report issues
Source files component.xml AdaptorRemoval.sh
Usage Example with default values


Name Type Mandatory Description
reads BinaryFile Mandatory Input files in FASTQ/FASTA format.
tags BinaryFile Optional A file with the tags to be trimmed. The file needs to have tag3 or tag5 as the first field and the tag as the second field.
overSeqs BinaryFile Optional Overrepresented sequences directly obtained from SeqQC component (via QCParser) or file with adaptors or tags that should be trimmed without specifying from which end. Illumina adaptors will be trimmed regardless of the percentage, other overrepresented sequences will only be trimmed if they exceed the percentage parameter.


Name Type Description
arrayOut Array<BinaryFile> The key "read" points to the trimmed file if trimming was performed or to the unmodified input read file if no trimming occurred, the "stats" and "tags" will point to the corresponding files, they may or may not exist depending on the parameters chosen.
removalLog BinaryFile Log file
predictLog BinaryFile Log file


Name Type Default Description
extra string "" Give one or more options as a string, e.g. extra="-trim_within". More options on the Tag Cleaner documentation site, or use extra=-help to see the options With this option you can use all of the parameters from the TagCleaner script, but right now it can only be used when trimming, i.e you cannot use it during the prediction step.
fastq boolean true Defines if the input file is FASTQ (true) or FASTA (false).
matrix string "exact" union, subset, or exact depending on how to match the tags to the sequences
mm3 int 1 Maximum number of allowed mismatches at the 3'-end. The independent definition for the 5'- and 3'-end of the reads accounts for the differences in tag sequences due to the limitations of the sequencing method used to generate the datasets. The 3'-end will in most cases show a lower number of matching tag sequences with low number of mismatches due to incomplete or missing tags at the ends of incompletely sequenced fragments. [default: 0]
mm5 int 1 Maximum number of allowed mismatches at the 5'-end. [default: 0]
out_format int 0 To change the output format, use one of the following options. If not defined, the output format will be the same as the input format. 1 (FASTA only), 2 (FASTA and QUAL) or 3 (FASTQ)
percentage int 15 The tags have to be overrepresented by this percentage to be trimmed.
predict boolean false Use this option to have TagCleaner predict the tag sequences. It will attempt to predict the tag at either or both sites, if possible. The algorithm implemented for the tag prediction assumes the randomness of a typical metagenome. Datasets that do not contain random sequences from organisms in an environment, but rather contain, for example, 16S data may cause incorrect detection of the tag sequences. However, the tag sequences will most likely be over-predicted and can be redefined by the user prior to data processing. The tag sequence prediction uses filtered base frequencies instead of raw base frequencies. This allows a more accurate prediction as it accounts for incomplete and shifted tag sequences. The output values are separated by tabs with the header line: "#Param Tag_Sequence Tag_Length Percent_Explained". If no tags are reported, then no tags could be identified in the data set. Cannot be used in combination with -tag3 or -tag5 or -stats. When using this option, no trimming will be performed.
stats boolean false Prints the number of tag sequences matching for different numbers of mismatches. In combination with -split, the number of sequences with fragment-to-fragment concatenations is printed as well. The output values are separated by tabs with the header line: "#Param Mismatches_or_Splits Number_of_Sequences Percentage Percentage_Sum". Cannot be used in combination with -predict and require -tag5 or -tag3. If predict is true, it will predict first and then obtains the stats; it cannot be used in combination with trim, so if trim is also true it will only run stats.
tag3 boolean true Set to true if tag sequence at 3'-end will be trimmed.
tag5 boolean true Set to true if tag sequence at 5'-end will be trimmed.
trim boolean true If true then it trims the tags in the read_tags file.
verbose boolean false Prints status and info messages during processing.

Test cases

Test case Parameters IN
case1 properties reads (missing) (missing) arrayOut (missing) (missing)

# Testing AdaptorRemoval component,
fastq = true,
predict = true,
trim = true,
verbose = false,
mm3 = 1,
mm5 = 1

case2 properties reads (missing) (missing) (missing) (missing) (missing)

# Testing AdaptorRemoval component,
fastq = true,
predict = true,
trim = false,
verbose = true,
stats = true,
tag3 = true,
tag5 = true

case3 properties reads tags (missing) arrayOut (missing) (missing)

# Testing AdaptorRemoval component,
fastq = true,
predict = false,
trim = true,
verbose = false,
percentage = 25,
stats = false,
tag5 = true,
tag3 = true

Generated 2019-02-08 07:42:12 by Anduril 2.0.0