Up: Component summary Component

Cufflinks

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.

For using this component, Cufflinks needs to be installed, which in turn requires Boost C++ libraries and SAM tools. See Cufflinks manual for more information about installing the softwares.

Cufflinks takes a text file of SAM alignments, or a binary SAM (BAM) file as input. The SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already. If you used another tool, you may want to make sure they are properly sorted as follows:

sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted

Note: The default value for the numeric options is -1 for the component. In that case the default value for Cufflinks is used which is different in every case and it is specified in the documentation of the parameters.

Version 1.0
Bundle sequencing
Categories
Authors Alejandra Cervera (alejandra.cervera@helsinki.fi)
Issue tracker View/Report issues
Requires Cufflinks ; installer (bash)
Source files component.xml Cufflinks.java
Usage Example with default values
Deprecated

This component is not being used anymore, last update to the software was 4 years ago, and it is supported in ExpressionQuantifier componenent.

Inputs

Name Type Mandatory Description
alignment AlignedReadSet Mandatory The aligned RNA-seq reads in SAM or BAM format.
reference_annotation1 GTF Optional Tells Cufflinks to use the supplied reference annotation (a GFF/GTF file) to estimate isoform expression. It will not assemble novel transcripts, and the program will ignore alignments not structurally compatible with any reference transcript. It sets the -G/--GTF in Cufflinks.
reference_annotation2 GTF Optional Tells Cufflinks to use the supplied reference annotation (GFF/GTF) to guide RABT assembly. Reference transcripts will be tiled with faux-reads to provide additional information in assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled. It sets the -g/--GTF-guide in Cufflinks.
mask GFF Optional Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates. It sets the -M/--mask-file in Cufflinks.
genome FASTA Optional Providing Cufflinks with a multifasta file via this option instructs it to run our new bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. It sets -b/--frag-bias-correct parameter in Cufflinks. See How Cufflinks Works for more details.

Outputs

Name Type Description
transcripts GTF This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There one GTF record per row, and each record represents either a transcript or an exon within a transcript.
isoforms FPKM_tracking This file contains the estimated isoform-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.
genes FPKM_tracking This file contains the estimated gene-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.

Parameters

Name Type Default Description
compatible_hits_norm boolean false With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default, and can only be used in combination with --GTF. Use with either RABT or ab initio assembly is not supported
frag_len_mean int -1 This is the expected (mean) fragment length. The default is 200bp. Note: Cufflinks now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads.
frag_len_std_dev int -1 The standard deviation for the distribution on fragment lengths. The default is 80bp. Note: Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads.
help boolean false Prints the help message and exits.
intron_overhang_tolerance int -1 The number of bp allowed to enter the intron of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 50 bp.
junc_alpha int -1 The alpha value for the binomial test used during false positive spliced alignment filtration. Default: 0.001
label string "" Cufflinks will report transfrags in GTF format, with a prefix given by this option. The default prefix is "CUFF".
library_type string "" In cases where Cufflinks cannot determine the platform and protocol used to generate input reads, you can supply this information manually, which will allow Cufflinks to infer source strand information with certain protocols. The available options are: ff-firststrand, ff-secondstrand, ff-unstranded, fr-firststrand, fr-secondstrand, fr-unstranded (default), and transfrags. For paired-end data, we currently only support protocols where reads are point towards each other.
max_bundle_length int -1 Maximum genomic length allowed for a given bundle. The default is 3,500,000 bp.
max_intron_length int -1 The maximum intron length. Cufflinks will not report transcripts with introns longer than this, and will ignore SAM alignments with REF_SKIP CIGAR operations longer than this. The default is 300,000.
max_mle_iterations int -1 Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
min_frags_per_transfrag int -1 Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are not reported. Default: 10.
min_intron_length int -1 Minimum intron size allowed in genome. The default is 50 bp.
min_isoform_fraction int -1 After calculating isoform abundance for a gene, Cufflinks filters out transcripts that it believes are very low abundance, because isoforms expressed at extremely low levels often cannot reliably be assembled, and may even be artifacts of incompletely spliced precursors of processed transcripts. This parameter is also used to filter out introns that have far fewer spliced alignments supporting them. The default is 0.1, or 10% of the most abundant isoform (the major isoform) of the gene.
multi_read_correct boolean false Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details.
no_faux_reads boolean false This option disables tiling of the reference transcripts with faux reads. Use this if you only want to use sequencing reads in assembly but do not want to output assembled transcripts that lay within reference transcripts. All reference transcripts in the input annotation will also be included in the output.
no_update_check boolean false Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version.
num_importance_samples int -1 Sets the number of importance samples generated for each locus during abundance estimation. Default: 1000
num_threads int -1 Use this many threads to align reads. The default is 1.
overhang_tolerance int -1 The number of bp allowed to enter the intron of a transcript when determining if a read or another transcript is mappable to/compatible with it. The default is 8 bp based on the default bowtie/TopHat parameters.
overhang_tolerance_3 int -1 The number of bp allowed to overhang the 3' end of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 600 bp.
pre_mrna_fraction int -1 Some RNA-Seq protocols produce a significant amount of reads that originate from incompletely spliced transcripts, and these reads can confound the assembly of fully spliced mRNAs. Cufflinks uses this parameter to filter out alignments that lie within the intronic intervals implied by the spliced alignments. The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads, and if the result is lower than this parameter value, the intronic alignments are ignored. The default is 15%.
quiet boolean true Suppress messages other than serious warnings and errors.
small_anchor_fraction int -1 Spliced reads with less than this percent of their length on each side of the junction are considered suspicious and are candidates for filtering prior to assembly. Default: 0.09.
total_hits_norm boolean true With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is active by default.
trim_3_avgcov_thresh int -1 Minimum average coverage required to attempt 3' trimming. The default is 10.
trim_3_dropoff_frac int -1 The fraction of average coverage below which to trim the 3' end of an assembled transcript. The default is 0.1.
upper_quartile_norm boolean false With this option, Cufflinks normalizes by the upper quartile of the number of fragments mapping to individual loci instead of the total number of sequenced fragments. This can improve robustness of differential expression calls for less abundant genes and transcripts.
verbose boolean false Print lots of status updates and other diagnostic information.

Test cases

Test case Parameters IN
alignment
IN
reference_annotation1
IN
reference_annotation2
IN
mask
IN
genome
OUT
transcripts
OUT
isoforms
OUT
genes
case1 properties alignment (missing) (missing) (missing) (missing) transcripts isoforms genes

# Testing cufflinks component

case2 properties alignment (missing) (missing) (missing) (missing) (missing) (missing) (missing)

# Testing cufflinks component

case3 properties alignment (missing) (missing) (missing) (missing) (missing) (missing) (missing)

# Testing cufflinks component


Generated 2019-02-08 07:42:12 by Anduril 2.0.0