Evaluate the quality of a de novo genome assembly using a variety of metrics. The input assembly may come from SGA or another assembler.
The quality metrics are partially based on Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2013, 2:10. Metrics are computed separately for contigs and scaffolds. Some metrics (denoted R) are only computed when the reference genome is provided. These metrics are computed for the whole genome, and some (C) also individually for each reference chromosome. Metrics are as follows:
Version | 0.1 |
---|---|
Bundle | sequencing |
Categories | Assembly |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | Scala ; BWA |
Source files | component.xml function.scala |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
contigs | FASTA | Mandatory | Assembled contigs. |
scaffolds | FASTA | Mandatory | Assembled scaffolds. |
reference | FASTA | Optional | Reference genome. If present, contigs and scaffolds are assembled to this genome and additional metrics are computed. The FASTA file must be accompanied with a BWT index created with "bwt index", in the same directory. This can be done with ReferenceIndexer component. For performance, a FASTA index (in Samtools format) is also recommended. |
Name | Type | Description |
---|---|---|
report | Latex | Assembly coverage vs. contig/scaffold length plots (generalization of N50). |
contigStats | CSV | Contig statistics that do not depend on the reference. There are two columns: Statistic (name of the metric) and Value. |
scaffoldStats | CSV | Scaffold statistics that do not depend on the reference. There are two columns: Statistic (name of the metric) and Value. |
contigReferenceStats | CSV | Contig statistics that depend on the reference (R): coverage, validity, multiplicity and parsimony. There are three columns: Statistic, Chromosome and Value. Statistics for all chromosomes combined are in rows with Chromosome=ALL. If the reference is not given, this file is empty. |
scaffoldReferenceStats | CSV | Scaffold statistics that depend on the reference (R): coverage, validity, multiplicity and parsimony. There are three columns: Statistic, Chromosome and Value. Statistics for all chromosomes combined are in rows with Chromosome=ALL. If the reference is not given, this file is empty. |
Test case | Parameters▼ | IN contigs |
IN scaffolds |
IN reference |
OUT report |
OUT contigStats |
OUT scaffoldStats |
OUT contigReferenceStats |
OUT scaffoldReferenceStats |
---|---|---|---|---|---|---|---|---|---|
case1 | (missing) | contigs | scaffolds | reference | (missing) | (missing) | (missing) | (missing) | (missing) |
case2_noref | (missing) | contigs | scaffolds | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
case3_empty | (missing) | contigs | scaffolds | reference | (missing) | contigStats | scaffoldStats | contigReferenceStats | scaffoldReferenceStats |