Cufflinks includes a program, "Cuffdiff", that you can use to find significant changes in transcript expression, splicing, and promoter use.
Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample. Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: Attribute Description tss_id The ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from. Cuffcompare appends this attribute to every transcript reported in the .combined.gtf file. p_id The ID of the coding sequence this transcript contains. This attribute is attached by Cuffcompare to the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts.
Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:
cuffcompare -s /path/to/genome_seqs.fa -r annotation.gtf annotation.gtf
The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and p_id attributes added to each record and this file can be used as input for cuffdiff.
Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. There are four FPKM tracking files: isoforms.fpkm_tracking, genes.fpkm_tracking, cds.fpkm_tracking, and tss_groups.fpkm_tracking.
Note: The default value for the numeric options is -1 for the component. In that case the default value for Cuffdiff is used which is different in every case and it is specified in the documentation of the parameters.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | |
Specialties | generic |
Authors | Alejandra Cervera (alejandra.cervera@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | Cufflinks ; installer (bash) |
Source files | component.xml Cuffdiff.java |
Usage | Example with default values |
Deprecated |
This component is not being used anymore, last update to the software was 4 years ago. |
Name | Type | Mandatory | Description |
---|---|---|---|
transcripts | GTF | Mandatory | A transcript annotation file produced by cufflinks, cuffcompare, or other source. |
array | Array<T1> (generic) | Mandatory | Either the sample file or the directory with each sample file and its replicates. Two sample files are mandatory, more samples and/or replicates are optional. |
mask | GFF | Optional | Tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates. It sets the -M/--mask-file in Cuffdiff. |
genome | FASTA | Optional | Providing Cufflinks with a multifasta file via this option instructs it to run our new bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. It sets -b/--frag-bias-correct parameter in Cufflinks. See How Cufflinks Works for more details. |
Name | Type | Description |
---|---|---|
isoforms | FPKM_tracking | Transcript FPKMs |
genes | FPKM_tracking | Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id |
cds | FPKM_tracking | Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id |
tss_groups | FPKM_tracking | Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id |
isoform_exp | Diff | Transcript differential FPKM |
gene_exp | Diff | Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id |
cds_exp | Diff | Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id |
tss_group_exp | Diff | Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id |
splicing | Diff | This tab delimited file lists, for each primary transcript, the amount of overloading detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file. |
cds_output | Diff | This tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here. |
promoters | Diff | This tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here. |
Name | Type | Default | Description |
---|---|---|---|
FDR | float | -1 | The allowed false discovery rate. The default is 0.05. |
compatible_hits_norm | boolean | false | With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default, and can only be used in combination with --GTF. Use with either RABT or ab initio assembly is not supported |
emit_count_tables | boolean | false | Cuffdiff will output a file for each condition (called sample_counts.txt) containing the fragment counts, fragment count variances, and fitted variance model. |
frag_len_mean | int | -1 | This is the expected (mean) fragment length. The default is 200bp. Note: Cufflinks now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads. |
frag_len_std_dev | int | -1 | The standard deviation for the distribution on fragment lengths. The default is 80bp. Note: Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads. |
help | boolean | false | Prints the help message and exits. |
labels | string | "" | Specify a label for each sample, which will be included in various output files produced by Cuffdiff. |
library_type | string | "" | In cases where Cufflinks cannot determine the platform and protocol used to generate input reads, you can supply this information manually, which will allow Cufflinks to infer source strand information with certain protocols. The available options are listed below. For paired-end data, we currently only support protocols where reads are point towards each other. |
max_bundle_frags | int | -1 | Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are marked with status HIDATA. Default: 1000000 |
max_mle_iterations | int | -1 | Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000 |
min_alignment_count | int | -1 | The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus' observed changes don't contribute to correction for multiple testing. The default is 10 fragment alignments. |
min_isoform_fraction | int | -1 | After calculating isoform abundance for a gene, Cufflinks filters out transcripts that it believes are very low abundance, because isoforms expressed at extremely low levels often cannot reliably be assembled, and may even be artifacts of incompletely spliced precursors of processed transcripts. This parameter is also used to filter out introns that have far fewer spliced alignments supporting them. The default is 0.1, or 10% of the most abundant isoform (the major isoform) of the gene. |
multi_read_correct | boolean | false | Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details. |
no_update_check | boolean | false | Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version. |
num_importance_samples | int | -1 | Sets the number of importance samples generated for each locus during abundance estimation. Default: 1000 |
num_threads | int | -1 | Use this many threads to align reads. The default is 1. |
poisson_dispersion | boolean | false | Use the Poisson fragment dispersion model instead of learning one in each condition. |
quiet | boolean | false | Suppress messages other than serious warnings and errors. |
time_series | boolean | false | Instructs Cuffdiff to analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Samples should be provided in increasing time order at the command line (e.g first time point SAM, second timepoint SAM, etc.) |
total_hits_norm | boolean | true | With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is active by default. |
upper_quartile_norm | boolean | false | With this option, Cufflinks normalizes by the upper quartile of the number of fragments mapping to individual loci instead of the total number of sequenced fragments. This can improve robustness of differential expression calls for less abundant genes and transcripts. |
verbose | boolean | false | Print lots of status updates and other diagnostic information. |
Test case | Parameters▼ | IN transcripts |
IN array |
IN mask |
IN genome |
OUT isoforms |
OUT genes |
OUT cds |
OUT tss_groups |
OUT isoform_exp |
OUT gene_exp |
OUT cds_exp |
OUT tss_group_exp |
OUT splicing |
OUT cds_output |
OUT promoters |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
case1 | properties | transcripts | array | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |
upper_quartile_norm=true, |
||||||||||||||||
case2 | (missing) | transcripts | array | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) | (missing) |