3. Data Preprocessing: Quality Controls, Alignment, Methylation Extraction and Filtering

This pipeline includes all steps to preprocess bisulfite sequencing data. It takes raw fastq files as input and produces methylation calls as output. Methylation calls can then be used to estimate differential methylation in the downstream analysis.


1) Data concatenation

If data from the same sample are spread out though several lanes of the same experiment, you might want to concatenate the fastq files into a single one. If data from the same patient come from different samples and/or are sequenced during different experiment, then it is better to preprocess them separately and merge the methylation calls after step 4.

2) Quality control and trimming

Fastq data have to meet certain quality criteria in order to be considered reliable for further analysis. Moreover parts of the read belonging to adaptors or having very low quality have to be trimmed. This step can be performed using the QCFasta component. Here we provide one example that can be included in the workflow. For more details you can check the QC pipeline.

3) Alignment

Bisulfite sequencing data require specific alignement tools, designed to handle the C>T conversion that occurs during the bisulfite treatment. There are two main classes of alignment tools: wild-card or three letter aligner. Bismark, the most commonly used aligner, belongs to the latter and is the one included in this pipeline (Bismark compoment). An alternative could be provided by the wild-card aligner BSMAP. Fastq files containing reads that survived step 2 can be given as input and the default output of this step is a BAM file.

4) Methylation extraction

Extraction of methylation levels for Cs across the genome can be performed with the alignment software (Bismark component) or with a separate one (MethylExtract component). It can be limited to CpGs or extended to Cs across all context. NOTE: MethylExtract requires a BAM file as input and, for some reasons, doesn’t work with SSBS data.

5) Filtering

This step is required only for SSBS data. It’s necessary to make sure that only the mapped CpGs that align to the regions targeted by SureSelect are selected. The SureSelect target regions can be extended of their 200bp flanking regions, to include the signal of reads overlapping them only in part. Since these regions are provided in hg19 coordinates, LiftOver can be used to convert them to the required genome assembly (in this case GRCh38). The mapping can then be performed using an REvaluate component and the following code Map_to_SureSelect.R as input script. The output is in the same format as the one in step 4, it simply contains less sites.

#!/usr/bin/env anduril
//$OPT -d /PATH/TO/FOLDER/execute_Preprocessing
//$OPT --threads 16
//$OPT --wrapper slurm-prefix
//$PRE startdate=$( date )
//$POST echo - "eRunning time:\n$startdate -> $( date )"

import anduril.builtin._
import anduril.tools._
import anduril.sequencing._
import anduril.microarray._
import org.anduril.runtime._

object Preprocessing {

        // Get raw reads and mates
        val reads = INPUT(path="/path/to/reads/BSfastQreads.csv")
        val mates = INPUT(path="//path/to/reads/BSfastQmates.csv")

        // 1) Concatenate data from different lanes but same sample //

        val fastqReads=BashEvaluate(
                echo -e "Key\tValue" > @out1@
                echo "Merging reads"
                while read line; do
                        key=$( echo $line | awk '{ print $1 }' )
                        echo "- $key"
                        r1=$( echo $line | awk '{ print $2 }' )
                        r2=$( echo $line | awk '{ print $3 }' )
                        cat $r1 $r2 > @arrayOut1@/${key}_reads.fq.gz
                        addarray arrayOut1 "$key" "${key}_reads.fq.gz"
                        check=$( zcat @arrayOut1@/${key}_reads.fq.gz | md5sum )
                        echo -e "${key}_reads\t$check" >> @out1@
                done < <( noheader @var1@ )

        val fastqMates=BashEvaluate(
                echo -e "Key\tValue" > @out1@
                echo "Merging mates"
                while read line; do
                        key=$( echo $line | awk '{ print $1 }' )
                        echo "- $key"
                        r1=$( echo $line | awk '{ print $2 }' )
                        r2=$( echo $line | awk '{ print $3 }' )
                        cat $r1 $r2 > @arrayOut1@/${key}_mates.fq.gz
                        addarray arrayOut1 "$key" "${key}_mates.fq.gz"
                        check=$( zcat @arrayOut1@/${key}_mates.fq.gz | md5sum )
                        echo -e "${key}_mates\t$check" >> @out1@
                done < <( noheader @var1@ )

        // 2) Quality control and trimming //

        val fastQCmeth = QCFasta(
            tool="trimGalore", // trimmomatic is also available


        // 3) Alignment and 4) Methylation extraction with Bismark //

        // The reference genome has to be converted into a format suitable for bisultife sequencing alignment.
        // Check Bismark documentation for more information.
        // Input here the folder containing the reference genome.
        val refGenome = INPUT(path="/path/to/refs/annotations/GRCh38d1vd1_Bisulfite_Converted")

        val alignMethCpG = NamedMap[BismarkAlign]("alignMethCpG")
        val alignCpG = NamedMap[BAM]("alignCpG")
        val bismarkExtractCpG = NamedMap[CSV]("bismarkExtractCpG")

        val repCpG = NamedMap[HTMLFile]("repCpG")

        for ((sample,f) <- iterArray(fastQCmeth.qcReads)) {

            alignMethCpG(sample) = BismarkAlign(
                alignExtra="-X 1200",        // Maximum insert size
                extractExtra="--no_overlap", // When read and mate overlap, count as only one
                extract=true, // set to false if you extract the methylation with MethylExtract
                // deleteNonCpG=true, set to false if you CHG and CHH sites needed as well

            // alignMethCpG(sample)._execute="once"

            alignCpG(sample) = alignMethCpG(sample).alignment
            bismarkExtractCpG(sample) = alignMethCpG(sample).methylationCalls

            repCpG(sample) = alignMethCpG(sample).report

        val alignmentsCpG = Array2CSV(in=alignCpG)

        val bismarkExtractedCpG = Array2CSV(in=bismarkExtractCpG)

        val reportsCpG = Array2CSV(in=repCpG)


        // 4) Methylation Extraction //

        // Enable this section if you set extract = false in Bismark

        val extracted = NamedMap[MethylExtract]("extracted")
        val extracted_destrF = NamedMap[MethylExtract]("extracted_destrF")
        val extractMeth = NamedMap[CSV]("extractMeth")
        val extractMeth_destrF = NamedMap[CSV]("extractMeth_destrF")

        for (sample <- iterCSV(alignmentsCpG.out)) {

            extracted(sample("Key")) = MethylExtract(
                bam = INPUT(path=sample("File")),
                refGenome = refGenome,
                destrand = true,
                threads = 6,
                firstIgnor = 10,
                lastIgnore = 5,
                removeChr = false


            extracted_destrF(sample("Key")) = MethylExtract(
                bam = INPUT(path=sample("File")),
                refGenome = refGenome,
                destrand = false,
                threads = 6,
                firstIgnor = 10,
                lastIgnore = 5,
                removeChr = false


            extractMeth(sample("Key")) = extracted(sample("Key")).CpGmeth
            extractMeth_destrF(sample("Key")) = extracted_destrF(sample("Key")).CpGmeth                               

        val extractMeth_CpG = Array2CSV(in=extractMeth)
        val extractMeth_destrF_CpG = Array2CSV(in=extractMeth_destrF)

        OUTPUT(in=extractMeth_destrF_CpG.out) */

        // 5) Filter only on-target CpGs //

        val hg19ToHg38_over_chain = INPUT(path="/path/to/hg19ToHg38.over.chain")

        // Get target regions and add column names
        val SureSelect_targets_padded_200_hg19 = CSVCleaner(
            in         = INPUT(path="/path/to/SureSelect_Info_hg19/S03770311_Padded_200.bed"),
            columnsIn  = "chr,start,end",
            columnsOut = "*",
            rowSkip    = 2

        // Add IDs
        val SureSelect_targets_padded_200_hg19_with_ID = CSVTransformer(
            csv1        = SureSelect_targets_padded_200_hg19.out,
            columnNames = "c(colnames(csv1), 'id')",
            transform1  = "csv1[,1]",
            transform2  = "formatC(csv1[,2], format='d')",
            transform3  = "formatC(csv1[,3], format='d')",
            transform4  = "paste0(csv1$chr, ':', csv1$start, '-', csv1$end)"

        // Convert target regions from hg19 to GRCh38
        val SureSelect_targets_padded_200_GRCh38 = BashEvaluate(
            var1   = SureSelect_targets_padded_200_hg19_with_ID.out,
            var2   = hg19ToHg38_over_chain,
            script = """
                        tail -n +2 @var1@ | sed 's/\"//g' > @out1@
                        liftOver @out1@ @var2@ @folder1@/output.bed @folder1@/unlifted.bed"""

        // Filter only the CpGs overlapping target regions
        val Rscript = INPUT(path="/path/to/Map_to_SureSelect.R")

        val Mapped_CpGs_padded_200 = REvaluate(
            script = Rscript,
            var1   = SureSelect_targets_padded_200_GRCh38.folder1,
            var2   = All_SSBS_samples.out


        // Check how many reads intesect target regions
        val Intersect_SureSelect_Targets_padded_200 = NamedMap[BashEvaluate]("Intersect_SureSelect_Targets_padded_200")

        val Intersecting_SureSelect_Targets_padded_200 = NamedMap[BinaryFile]("Intersecting_SureSelect_Targets_padded_200")

        for (sample <- iterCSV(alignmentsCpG.out)) {

            Intersect_SureSelect_Targets_padded_200(sample("Key")) = BashEvaluate(
                  intersectBed -abam @param1@ -b @var1@/output.bed > @folder1@/reads.in.regions.bam

                  total=$( samtools view -c -F 260 @param1@ )
                  targeted=$( samtools view -c -F 260 @folder1@/reads.in.regions.bam )

                  ( tsvecho Key Reads
                    printf "%s\t%d\n" @param2@"_Aligned" "${total}"
                    printf "%s\t%d\n" @param2@"_Targeted" "${targeted}" ) > @out1@
            Intersecting_SureSelect_Targets_padded_200(sample("Key")) = Intersect_SureSelect_Targets_padded_200(sample("Key")).out1

        val Intersected_SureSelect_Targets_padded_200 = Array2CSV(in=Intersecting_SureSelect_Targets_padded_200)