Spliced Transcripts Alignment to a Reference for RNA-seq
The reason to implement STAR component was that TopHat is around 50 times slower, and that time is counted in days. STAR's output is also compatible with Cufflinks and output quality is comparable. So, unless you want to use Sawfish for its extreme speed, but still want to benefit from paired end reads, STAR is a good option.
Consult STAR webpage. To google STAR, use "STAR rna-seq". Otherwise you will find nothing.
The output will be sorted only if SortedByCoord is chosen as mainAlignmentType.
The STARGenome component may be used to generate the genome input for this component. The word "genome" very confusing, but is used by STAR. It actually means something like transcriptome index.
STAR can be run in so called 2-pass mode. The splice junctions from the first one are used to build a new genome index, but you need the original FASTA files for this. The idea is this: "STAR will not discover any new junctions but will align spliced reads with short overhangs across the previously detected junctions." See this thread.
In practice, this takes four steps.
The genome generation steps take an hour at 24 threads and consume 30G disk space.
For an alignment pass you need reads in FASTQ or FASTA files and a STAR genome.
The STARGenome component may be used to generate the genome input for this component, and also for the two-pass mode.
The two-pass mode's genome generation step is enabled by providing the genomeFasta and spliceJunctions inputs. The rest of the inputs are ignored.
The most up-to-date input parameters and their defaults can be found in the parametersDefault file in the STAR source directory (e.g. /opt/share/STAR). You may want to copy this file to use as a template for the parameters input. Any additional flags you pass in the "options" parameter will override settings in the file. Finally, the flags "outFileNamePrefix" and "outStd" are overridden by this component.
Options may be also used to stream compressed or encrypted input data, for example by using utilities such as zcat, acat or ccat.
The default is to use shared memory, which saves around 30 gigabytes per STAR process after the first one.
In new versions of STAR the genome is shared even if you use "genomeLoad=LoadAndRemove", and unloaded when the last STAR exits. In older versions it would still load many copies.
If STAR crashes, the shared memory segment may be left *permanently* in memory, which would be a really, really bad thing. This is why you must make sure it is unloaded. Right now this can only be done manually. So if you can, choose a list of hosts and verify after running that there is nothing left.
These are contained in the STAR component directory. There is an example script for loading, unloading and generating genomes.
list_shared_memory.sh does what is says - it displays the shared memory segments and also tells your how to unload them.
Knowing a few command line commands is crucial for error situations. When STAR crashes during loading, it will leave a trash copy of the genome in memory, permanently using lots of it. STAR has also been observed to simply hang and do nothing forever, because of a corrupt genome in memory.
Two commands are enough to solve all issues.
ipcs -m will list shared memory segments, allowing you to see how many processes are still using the memory block (nattach). The ipcrm -m command may be used to remove a segment from memory, specifying the shmid identifier from the list command.
After starting the alignment runs, you can check that nattach reflects the number of alignment processes on a node, to confirm memory sharing is being used. Something is wrong if you see more than one such line from STAR.
If a genome was loaded incompletely (STAR was killed or such), it must be manually unloaded, or it will stay there until reboot.
For sorting or having the output in BAM format use the --outSAMtype [SAM|BAM|None] [Unsorted] [SortedByCoordinate]. When SortedByCoordinate is used then --limitBAMsortRAM needs to be defined as well.
Version | 2.0 |
---|---|
Bundle | sequencing |
Categories | Alignment |
Authors | Lauri Lyly (lauri.lyly@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | STAR ; samtools ; atool (DEB) ; installer (bash) |
Source files | component.xml main.sh |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
genome | BinaryFolder | Mandatory | A STAR genome, which can be generated from a required FASTA file, and optional annotation and optional list of splice junctions. Either download or generate a genome (STARGenome component) for your annotation and read length. Standard genomes generated so far should go to: /mnt/csc-gc5/resources/STAR/ |
reads | Array<BinaryFile> | Mandatory | FASTA or FASTQ file containing reads for the alignment. Note: If your files are gzipped, you need to specify a parameter to STAR telling how to uncompress them, e.g. "--readFilesCommand zcat" or "--readFilesCommand acat". zcat should work even if the file is uncompressed. You can test these alternatives on the command line by e.g. "zcat myfile.fq.gz | head". In some cases, single fastq files are .tar.gz or .tgz compressed - in that case use "tar Ozxf", replacing z with j for bz2 compression. The acat utility works for typical compressed formats, which is why it is the default. acat supports only compressed files. For acat you need to have the atool package installed. |
mates | Array<BinaryFile> | Optional | FASTA or FASTQ file containing mates. Required for paired end data. |
parameters | TextFile | Optional | This file overrides default STAR parameters, but will itself be overridden by the command line. Use parametersDefault from STAR source as template. |
Name | Type | Description |
---|---|---|
folder | BinaryFolder | All files created by STAR in the output folder. |
alignment | AlignedReadSet | (Sorted) alignment. A coordinate sorted file will be indexed, i.e. there is a .bai file. |
spliceJunctions | CSV | Splice junctions. This CSV file is created by adding a header to STAR output. ("Chromosome\tStart\tEnd\tStrand\tIntronMotif\tAnnotated\tUniqueMapping\tMultiMapping\tMaxOverhang"):
|
Name | Type | Default | Description |
---|---|---|---|
genomeLoad | string | "LoadAndRemove" | LoadAndRemove works for parallel STAR instances and if everything goes fine, should free memory after the last STAR exits. LoadAndKeep, LoadAndRemove, Remove, LoadAndExit and NoSharedMemory are the options. |
mainAlignmentType | string | "" | Depending on the parameters more than one alignment may be produced (ex. sortedByCoord or toTranscriptome). The alignment not selected will still be available in the folder output. The string defined here will define which alignment will be linked to the alignment output of this component. |
options | string | "--readFilesCommand acat --outSAMattributes All" | Appended to STAR command line. Overrides all other ways of specifying parameters. Defaults can be found in the parametersDefault file in the STAR source directory.
See the reads input for relevant options.
Some parameters are important for Cufflinks compatibility (quoting from a web discussion somewhere):
|
threads | int | 1 | Number of threads passed to STAR. Pass a similar parameter separately in sortOptions! |
Test case | Parameters▼ | IN genome |
IN reads |
IN mates |
IN parameters |
OUT folder |
OUT alignment |
OUT spliceJunctions |
---|---|---|---|---|---|---|---|---|
case1 | properties | (missing) | reads | mates | (missing) | (missing) | (missing) | (missing) |
genomeLoad=NoSharedMemory |