Up: Component summary Component

STAR

Spliced Transcripts Alignment to a Reference for RNA-seq

The reason to implement STAR component was that TopHat is around 50 times slower, and that time is counted in days. STAR's output is also compatible with Cufflinks and output quality is comparable. So, unless you want to use Sawfish for its extreme speed, but still want to benefit from paired end reads, STAR is a good option.

Consult STAR webpage. To google STAR, use "STAR rna-seq". Otherwise you will find nothing.

The output will be sorted only if SortedByCoord is chosen as mainAlignmentType.

The STARGenome component may be used to generate the genome input for this component. The word "genome" very confusing, but is used by STAR. It actually means something like transcriptome index.

Two-pass mode for novel junction alignment

STAR can be run in so called 2-pass mode. The splice junctions from the first one are used to build a new genome index, but you need the original FASTA files for this. The idea is this: "STAR will not discover any new junctions but will align spliced reads with short overhangs across the previously detected junctions." See this thread.

In practice, this takes four steps.

  1. Primary alignment passes, yielding splice junctions
  2. Processing the splice junctions CSV output to include only interesting ones, such as those which are unannotated and have many unique reads.
  3. Genome generation from splice junctions
  4. Second alignment passes
This design allows you to both reuse the spliced genome for different raw reads, and combine novel splice junctions from multiple samples into a single STAR genome.

The genome generation steps take an hour at 24 threads and consume 30G disk space.

Inputs and parameters

For an alignment pass you need reads in FASTQ or FASTA files and a STAR genome.

The STARGenome component may be used to generate the genome input for this component, and also for the two-pass mode.

The two-pass mode's genome generation step is enabled by providing the genomeFasta and spliceJunctions inputs. The rest of the inputs are ignored.

The most up-to-date input parameters and their defaults can be found in the parametersDefault file in the STAR source directory (e.g. /opt/share/STAR). You may want to copy this file to use as a template for the parameters input. Any additional flags you pass in the "options" parameter will override settings in the file. Finally, the flags "outFileNamePrefix" and "outStd" are overridden by this component.

Options may be also used to stream compressed or encrypted input data, for example by using utilities such as zcat, acat or ccat.

Genome preloading into shared memory for the alignment

The default is to use shared memory, which saves around 30 gigabytes per STAR process after the first one.

In new versions of STAR the genome is shared even if you use "genomeLoad=LoadAndRemove", and unloaded when the last STAR exits. In older versions it would still load many copies.

If STAR crashes, the shared memory segment may be left *permanently* in memory, which would be a really, really bad thing. This is why you must make sure it is unloaded. Right now this can only be done manually. So if you can, choose a list of hosts and verify after running that there is nothing left.

Helper scripts for genome indices

These are contained in the STAR component directory. There is an example script for loading, unloading and generating genomes.

list_shared_memory.sh does what is says - it displays the shared memory segments and also tells your how to unload them.

Error situations in genome loading

Knowing a few command line commands is crucial for error situations. When STAR crashes during loading, it will leave a trash copy of the genome in memory, permanently using lots of it. STAR has also been observed to simply hang and do nothing forever, because of a corrupt genome in memory.

Two commands are enough to solve all issues.

ipcs -m will list shared memory segments, allowing you to see how many processes are still using the memory block (nattach). The ipcrm -m command may be used to remove a segment from memory, specifying the shmid identifier from the list command.

After starting the alignment runs, you can check that nattach reflects the number of alignment processes on a node, to confirm memory sharing is being used. Something is wrong if you see more than one such line from STAR.

If a genome was loaded incompletely (STAR was killed or such), it must be manually unloaded, or it will stay there until reboot.

For sorting or having the output in BAM format use the --outSAMtype [SAM|BAM|None] [Unsorted] [SortedByCoordinate]. When SortedByCoordinate is used then --limitBAMsortRAM needs to be defined as well.

Version 2.0
Bundle sequencing
Categories Alignment
Authors Lauri Lyly (lauri.lyly@helsinki.fi)
Issue tracker View/Report issues
Requires STAR ; samtools ; atool (DEB) ; installer (bash)
Source files component.xml main.sh
Usage Example with default values

Inputs

Name Type Mandatory Description
genome BinaryFolder Mandatory A STAR genome, which can be generated from a required FASTA file, and optional annotation and optional list of splice junctions. Either download or generate a genome (STARGenome component) for your annotation and read length. Standard genomes generated so far should go to: /mnt/csc-gc5/resources/STAR/
reads Array<BinaryFile> Mandatory FASTA or FASTQ file containing reads for the alignment. Note: If your files are gzipped, you need to specify a parameter to STAR telling how to uncompress them, e.g. "--readFilesCommand zcat" or "--readFilesCommand acat". zcat should work even if the file is uncompressed. You can test these alternatives on the command line by e.g. "zcat myfile.fq.gz | head". In some cases, single fastq files are .tar.gz or .tgz compressed - in that case use "tar Ozxf", replacing z with j for bz2 compression. The acat utility works for typical compressed formats, which is why it is the default. acat supports only compressed files. For acat you need to have the atool package installed.
mates Array<BinaryFile> Optional FASTA or FASTQ file containing mates. Required for paired end data.
parameters TextFile Optional This file overrides default STAR parameters, but will itself be overridden by the command line. Use parametersDefault from STAR source as template.

Outputs

Name Type Description
folder BinaryFolder All files created by STAR in the output folder.
alignment AlignedReadSet (Sorted) alignment. A coordinate sorted file will be indexed, i.e. there is a .bai file.
spliceJunctions CSV Splice junctions. This CSV file is created by adding a header to STAR output. ("Chromosome\tStart\tEnd\tStrand\tIntronMotif\tAnnotated\tUniqueMapping\tMultiMapping\tMaxOverhang"):
  1. Column 1: chromosome
  2. Column 2: first base of the intron (1-based)
  3. Column 3: last base of the intron (1-based)
  4. Column 4: strand
  5. Column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
  6. Column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
  7. Column 7: number of uniquely mapping reads crossing the junction
  8. Column 8: number of multi-mapping reads crossing the junction
  9. Column 9: maximum spliced alignment overhang
Of these, the following awk expression is relevant for 2-pass STAR: "if($5>0){print $1,$2,$3,strChar[$4]}}" In other words, chromosome and intron boundaries known, and intron motif is classified. Rest of the columns are useful for filtering interesting junctions in 2-pass STAR.

Parameters

Name Type Default Description
genomeLoad string "LoadAndRemove" LoadAndRemove works for parallel STAR instances and if everything goes fine, should free memory after the last STAR exits. LoadAndKeep, LoadAndRemove, Remove, LoadAndExit and NoSharedMemory are the options.
mainAlignmentType string "" Depending on the parameters more than one alignment may be produced (ex. sortedByCoord or toTranscriptome). The alignment not selected will still be available in the folder output. The string defined here will define which alignment will be linked to the alignment output of this component.
options string "--readFilesCommand acat --outSAMattributes All" Appended to STAR command line. Overrides all other ways of specifying parameters. Defaults can be found in the parametersDefault file in the STAR source directory. See the reads input for relevant options. Some parameters are important for Cufflinks compatibility (quoting from a web discussion somewhere):
  1. For non-strand-specific data, you need to use STAR option --outSAMstrandField intronMotif which will add the XS attribute to all canonically spliced alignments using their introns' motifs - that's exactly what Cufflinks needs.
  2. For strand-specific data, you do not need any extra parameters for STAR runs, but you need to use --library-type option for Cufflinks. For example, for the "standard" dUTP protocol you need to use --library-type fr-firststrand in Cufflinks.
Often needed options:
  1. --outFilterIntronMotifs RemoveNoncanonicalUnannotated
  2. --outFilterType BySJout
threads int 1 Number of threads passed to STAR. Pass a similar parameter separately in sortOptions!

Test cases

Test case Parameters IN
genome
IN
reads
IN
mates
IN
parameters
OUT
folder
OUT
alignment
OUT
spliceJunctions
case1 properties (missing) reads mates (missing) (missing) (missing) (missing)

genomeLoad=NoSharedMemory


Generated 2019-02-08 07:42:12 by Anduril 2.0.0