Up: Component summary Function

AssemblyQuality

Evaluate the quality of a de novo genome assembly using a variety of metrics. The input assembly may come from SGA or another assembler.

The quality metrics are partially based on Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2013, 2:10. Metrics are computed separately for contigs and scaffolds. Some metrics (denoted R) are only computed when the reference genome is provided. These metrics are computed for the whole genome, and some (C) also individually for each reference chromosome. Metrics are as follows:

Version 0.1
Bundle sequencing
Categories Assembly
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Requires Scala ; BWA
Source files component.xml function.scala
Usage Example with default values

Inputs

Name Type Mandatory Description
contigs FASTA Mandatory Assembled contigs.
scaffolds FASTA Mandatory Assembled scaffolds.
reference FASTA Optional Reference genome. If present, contigs and scaffolds are assembled to this genome and additional metrics are computed. The FASTA file must be accompanied with a BWT index created with "bwt index", in the same directory. This can be done with ReferenceIndexer component. For performance, a FASTA index (in Samtools format) is also recommended.

Outputs

Name Type Description
report Latex Assembly coverage vs. contig/scaffold length plots (generalization of N50).
contigStats CSV Contig statistics that do not depend on the reference. There are two columns: Statistic (name of the metric) and Value.
scaffoldStats CSV Scaffold statistics that do not depend on the reference. There are two columns: Statistic (name of the metric) and Value.
contigReferenceStats CSV Contig statistics that depend on the reference (R): coverage, validity, multiplicity and parsimony. There are three columns: Statistic, Chromosome and Value. Statistics for all chromosomes combined are in rows with Chromosome=ALL. If the reference is not given, this file is empty.
scaffoldReferenceStats CSV Scaffold statistics that depend on the reference (R): coverage, validity, multiplicity and parsimony. There are three columns: Statistic, Chromosome and Value. Statistics for all chromosomes combined are in rows with Chromosome=ALL. If the reference is not given, this file is empty.

Test cases

Test case Parameters IN
contigs
IN
scaffolds
IN
reference
OUT
report
OUT
contigStats
OUT
scaffoldStats
OUT
contigReferenceStats
OUT
scaffoldReferenceStats
case1 (missing) contigs scaffolds reference (missing) (missing) (missing) (missing) (missing)
case2_noref (missing) contigs scaffolds (missing) (missing) (missing) (missing) (missing) (missing)
case3_empty (missing) contigs scaffolds reference (missing) contigStats scaffoldStats contigReferenceStats scaffoldReferenceStats

Generated 2019-02-08 07:42:21 by Anduril 2.0.0