Up: Component summary Component

Cuffdiff

Cufflinks includes a program, "Cuffdiff", that you can use to find significant changes in transcript expression, splicing, and promoter use.

Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: Attribute Description tss_id The ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. Cuffcompare appends this attribute to every transcript reported in the .combined.gtf file. p_id The ID of the coding sequence this transcript contains. This attribute is attached by Cuffcompare to the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts.

Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:

cuffcompare -s /path/to/genome_seqs.fa -r annotation.gtf annotation.gtf

The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and p_id attributes added to each record and this file can be used as input for cuffdiff.

Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. There are four FPKM tracking files: isoforms.fpkm_tracking, genes.fpkm_tracking, cds.fpkm_tracking, and tss_groups.fpkm_tracking.

Note: The default value for the numeric options is -1 for the component. In that case the default value for Cuffdiff is used which is different in every case and it is specified in the documentation of the parameters.

Version 1.0
Bundle sequencing
Categories
Specialties generic
Authors Alejandra Cervera (alejandra.cervera@helsinki.fi)
Issue tracker View/Report issues
Requires Cufflinks ; installer (bash)
Source files component.xml Cuffdiff.java
Usage Example with default values
Deprecated

This component is not being used anymore, last update to the software was 4 years ago.

Type parameters (generics)

Inputs

Name Type Mandatory Description
transcripts GTF Mandatory A transcript annotation file produced by cufflinks, cuffcompare, or other source.
array Array<T1> (generic) Mandatory Either the sample file or the directory with each sample file and its replicates. Two sample files are mandatory, more samples and/or replicates are optional.
mask GFF Optional Tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates. It sets the -M/--mask-file in Cuffdiff.
genome FASTA Optional Providing Cufflinks with a multifasta file via this option instructs it to run our new bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. It sets -b/--frag-bias-correct parameter in Cufflinks. See How Cufflinks Works for more details.

Outputs

Name Type Description
isoforms FPKM_tracking Transcript FPKMs
genes FPKM_tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id
cds FPKM_tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id
tss_groups FPKM_tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id
isoform_exp Diff Transcript differential FPKM
gene_exp Diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id
cds_exp Diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
tss_group_exp Diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
splicing Diff This tab delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.
cds_output Diff This tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.
promoters Diff This tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.

Parameters

Name Type Default Description
FDR float -1 The allowed false discovery rate. The default is 0.05.
compatible_hits_norm boolean false With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default, and can only be used in combination with --GTF. Use with either RABT or ab initio assembly is not supported
emit_count_tables boolean false Cuffdiff will output a file for each condition (called sample_counts.txt) containing the fragment counts, fragment count variances, and fitted variance model.
frag_len_mean int -1 This is the expected (mean) fragment length. The default is 200bp. Note: Cufflinks now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads.
frag_len_std_dev int -1 The standard deviation for the distribution on fragment lengths. The default is 80bp. Note: Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads.
help boolean false Prints the help message and exits.
labels string "" Specify a label for each sample, which will be included in various output files produced by Cuffdiff.
library_type string "" In cases where Cufflinks cannot determine the platform and protocol used to generate input reads, you can supply this information manually, which will allow Cufflinks to infer source strand information with certain protocols. The available options are listed below. For paired-end data, we currently only support protocols where reads are point towards each other.
max_bundle_frags int -1 Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are marked with status HIDATA. Default: 1000000
max_mle_iterations int -1 Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
min_alignment_count int -1 The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus' observed changes don't contribute to correction for multiple testing. The default is 10 fragment alignments.
min_isoform_fraction int -1 After calculating isoform abundance for a gene, Cufflinks filters out transcripts that it believes are very low abundance, because isoforms expressed at extremely low levels often cannot reliably be assembled, and may even be artifacts of incompletely spliced precursors of processed transcripts. This parameter is also used to filter out introns that have far fewer spliced alignments supporting them. The default is 0.1, or 10% of the most abundant isoform (the major isoform) of the gene.
multi_read_correct boolean false Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details.
no_update_check boolean false Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version.
num_importance_samples int -1 Sets the number of importance samples generated for each locus during abundance estimation. Default: 1000
num_threads int -1 Use this many threads to align reads. The default is 1.
poisson_dispersion boolean false Use the Poisson fragment dispersion model instead of learning one in each condition.
quiet boolean false Suppress messages other than serious warnings and errors.
time_series boolean false Instructs Cuffdiff to analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Samples should be provided in increasing time order at the command line (e.g first time point SAM, second timepoint SAM, etc.)
total_hits_norm boolean true With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is active by default.
upper_quartile_norm boolean false With this option, Cufflinks normalizes by the upper quartile of the number of fragments mapping to individual loci instead of the total number of sequenced fragments. This can improve robustness of differential expression calls for less abundant genes and transcripts.
verbose boolean false Print lots of status updates and other diagnostic information.

Test cases

Test case Parameters IN
transcripts
IN
array
IN
mask
IN
genome
OUT
isoforms
OUT
genes
OUT
cds
OUT
tss_groups
OUT
isoform_exp
OUT
gene_exp
OUT
cds_exp
OUT
tss_group_exp
OUT
splicing
OUT
cds_output
OUT
promoters
case1 properties transcripts array (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

upper_quartile_norm=true,
verbose=true,
num_threads=3,
multi_read_correct=true,
compatible_hits_norm=true,
max_bundle_frags=1000000 ,
labels=conditionA,conditionB

case2 (missing) transcripts array (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

Generated 2019-02-08 07:42:12 by Anduril 2.0.0