RNA-seq

When you click on the tiled button “RNA-seq” in the Work Space of the on the start page, the following listing will appear:

_images/rna_seq1.png

RNA-seq preprocessing

SRA to FASTQ

This workflow can be used to convert SRA data files (e. g. from NGS/RNA-seq experiments) into FASTQ files. The FASTQ format is widely used by a number of tools and the geneXplain platform is among them; on the other hand, NGS data are often collected in SRA format, thus the conversion of SRA format into FASTQ format is an important function. An example of public data stored in SRA format can be found here (https://www.ncbi.nlm.nih.gov/sra?term=SRP051443) and can be uploaded directly via FTP import into the geneXplain platform. The workflow “SRA to FASTQ” can be found on the Start page, under the NGS/RNA-seq button.

_images/rna_seq2.png

To launch the workflow, follow these steps:

Step1. Open the workflow input form from the Start page. It looks as shown below:

_images/rna_seq3.png

Step 2. Specify the folder with the SRA files in the field Input folder. You can drag it from your project within the tree area and drop it in the box beside the folder pictogram. Alternatively, you may click on the field (select element) and a new window will be opened, where you can select the input folder.

Example of an input folder

It contains 12 files in SRA format as shown below. Please note, this folder occupies 1.5 GB work space.

_images/image102.png

The output folder name and folder path is automatically created, but can be changed in a user-specific way.

Press [Run workflow] and wait till the workflow is completed.

Example of an output folder:

https://platform.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC),%20GSE32424,%20FASTQ%20files/Data/Fastq%20files

_images/7299fb0d9f33457de394643a9ed929ae.png

The output folder contains 12 files with the same names and now with the extension fastq. Please note, the size of the output folder is 16.6 GB.

Note Working with NGS data in SRA and FASTQ formats requires substantial work space available in your user account. Feel free to contact us (info@genexplain.com) to upgrade your account with additional disk space.

Convert genome coordinates with Lift-Over

While working with NGS data, quite often it might quite often be required to convert positions from one genome assembly to another one, and The Lift-Over program can be applied for this task. Within the geneXplain platform, you can find it on the Start page under the NGS button, as highlighted below.

_images/rna_seq4.png

This tool is based on the LiftOver utility and Chain track from the UC Santa Cruz Genome Browser.

It converts coordinates and annotations between assemblies and genomes. The input is a track with genomic positions according to a particular genome assembly, and the output is a track with positions according to another genome assembly.

To launch the workflow, follow these steps:

Step1. Open the input form from the Start page. It looks as shown below:

_images/rna_seq5.png

Step 2. Specify the input track in the field Convert coordinates of. You can drag it from your project within the tree area and drop it in the box of the field.

The further steps of the workflow are demonstrated by means of the tables in one of the pre-prepared examples. You can find these tables here:

https://platform.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC),%20GSE32424,%20FASTQ%20files/Data/Lift_over

Step 3. Specify the Mapping of the input track by selecting the desired genome conversion from the drop-down menu.

_images/a2bb67da68b0935a6cedb06b63202029.png

In the majority of cases not everything can be remapped to another assembly. The minimum ratio of bases that must be re-mapped is by default 0.95, you can change this by typing in this field.

Step 4. Allow multiple output regions: choose Yes or No.

Step 5. Define where the output tracks should be located in the tree. The method produces two output tracks, one containing all the mapped coordinates and the other containing the unmapped coordinates (if existing).

After filling out all input fields press [Run] and wait till the method is completed.

The output is a folder with two tracks as shown below:

_images/27dc92886573ab2a57bb19677b75ff2f.png

Alignment of FASTQ with Bowtie

Bowtie is a short-read aligner designed to be ultrafast and memory-efficient. It was developed by Ben Langmead and Cole Trapnell (Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25). The Bowtie method can be found in the Galaxy section of the platform (analyses/Galaxy/solexa_tools/bowtie_wrapper).

Input format: Bowtie accepts files in Sanger FASTQ format. Often, the sequence files are represented in SRA format (Sequence Read Archives). This format is used to deposit sequences at the Sequence Read Archives at NCBI, EBI, and DDBJ. Please consider converting the SRA files into FASTQ files before starting Bowtie. You can use the conversion tool Convert Files in SRA format to FASTQ or Convert Files in SRA format to paired FASTQ in the Galaxy section of the platform (analyses/Galaxy/sra_toolkit/fastq-dump).

To launch the Bowtie tool, follow these steps:

Step1. Open the Bowtie input form from the Start page by clicking on Alignment of FASTQ with Bowtie option in the RNA-seq preprocessing subsection. It will open in the main Work Space and looks as shown below:

_images/rna_seq6.png

Step 2. Specify the reference genome.

The field Will you select a reference genome from your history or use a built-in index? defines the reference genome. If you keep the default “Use a built-in index” the program makes an alignment to the reference genome which is provided as part of the platform (the genome builds for human, mouse and rat are provided). Depending on the species used, please specify hg19 for human, mm9 for mouse and rn4 for rat in the next field Select a reference genome.

If you select the “Use one from the history” option in the 1st field, two new fields will appear in the form: Select the reference genome and Choose whether to use Default options for building indices or to Set your own, as shown in the screenshot below, highlighted by the red ovals. In this case you should provide a preloaded reference genome (preloaded in Fasta, EMBL or GeneBank formats) and choose how to build sequence indices which will be used by the alignment algorithm of Bowtie.

_images/rna_seq7.png

Step 3. The field Is this library mate-paired? defines the type of the sequence library which was used in NGS sequencing. The default is Single-end, the alternative is Paired-end. You should know this detail about your library of short reads.

Step 4. Specify the input file in the field FASTQ file. You can either drag-and-drop or select the file name from the Tree area. Here, as an example, we use data from a published RNA-seq experiment analyzing the human esophageal squamous cell carcinoma (ESCC), GSE32424. FASTQ files are found here: https://platform.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC),%20GSE32424,%20FASTQ%20files/Data/Fastq%20files

This example contains results of an RNA-seq Illumina NGS sequencing of twelve clinical samples from human esophageal squamous cell carcinoma (ESCC) (seven tumors and five non-tumors). The authors provided sequences as so-called non-aligned BAM files. We loaded these BAM files directly from GEO as one archive using the ftp uploading function of the geneXplain platform. After that, we converted the non-aligned BAM files into FASTQ files using the tool: SAM to FASTQ from the NGS: Picard (beta) subsection of the Galaxy section of the platform (analyses/Galaxy/picard_beta/picard_SamToFastq).

Step 5. The field Bowtie settings to use is set by default to Commonly used. If you change it to the Full parameter list from the drop-down menu, a full list of parameters of the alignment algorithm is enabled for editing. The full description of all these parameters is given in the original paper of Langmead et al. mentioned above which describes the algorithm of Bowtie.

Step 6. Please keep the box Suppress the header in the output SAM file unchecked (as it is by default) to generate SAM/BAM output files suitable for further use by Cufflinks tool.

Step 7. Set the output file name (for the output BAM file) in the field Map with Bowtie for Illumina… and press the button [Run].

Tip It is recommended to save the output file into a separate folder containing all BAM output files from one particular experiment. This will allow you to run the next workflow for the quantification of all SAM/BAM files from this defined folder.

Results

The result of this method is one BAM file which is generated by the Bowtie program as the result of the alignment of the sequence reads from the input FASTQ file to the reference genome.

Generated BAM file is a track and has the (_images/track.jpg) icon in the tree. As usual for all tracks, it opens in the Work Space. You can see the positions of each aligned read in the genome and upon zooming-in you get all detailed information about each read complete with sequence, length, and quality.

_images/rna_seq8.png

In the info box you can see information about the output BAM file. The number of aligned and not aligned reads and overall file size is shown.

_images/rna_seq9.png

Note The input FASTQ and output BAM files of the Bowtie tool require a considerable amount of working space. One FASTQ file can occupy several GB of space. If you need more space for storage and work with your FASTQ and BAM files, please feel free to ask for details (info@genexplain.com).

Find genome variations and indels from RNA-seq

The challenge of obtaining accurate variant calls from RNA-seq data is substantial. The workflow is based on a framework to discover genotype variations published by De Pristo et al., Nature Genetics 43:491-498, 2011. The process applied includes initial read mapping, local realignment around indels, base quality score recalibration, SNP discovery and genotyping to find all potential variants.

The workflow can be found in the section “RNA-seq Preprocessing”.

_images/rna_seq10.png

Step 1. Open the workflow input form from the Start page. It will open in the main Work Space and looks as shown below:

_images/c37940fe72dc4dc3427f093d16c17cbe.png

Step 2. Specify the input file in FASTQ format in the field Input fastq file.
To specify the fastq file, you can drag & drop it from your project within the tree area. Alternatively, you may click on the pink field “select element” and a new window will open, where you select the input track. After having selected the track, press the [Ok] button.

Step 3. Specify the Minimum read segment length. By default a minimum length of 25 is given.

Step 4. Define where the folder with the results should be located in your project tree. You can do so by clicking on the pink field “select element” in the field OutputFolder, and a new window will be opened, where you can select the location of the results folder and define its name.

Start the workflow by pressing the [Run workflow] button.

In the following example we took as input the fastq file [SRR349741.fastq]

https://platform.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC),%20GSE32424,%20FASTQ%20files/Data/Fastq%20files/SRR349741.fastq

The output folder can be found here: https://ge.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC)%2C%20GSE32424%2C%20FASTQ%20files/Data/SRR349741.fastq%20(Genome%20variants%20and%20indels%20from%20RNA-seq)/summary_hisat2

The output folder contains several files and subfolders with all results of the analysis.

_images/rna_seq11.png

The first step of the workflow is an alignment of all reads of fastq file using the HISAT2 tool. In the result folder one can see a sub-folder “tmp” which contains all found indels. They are stored as tracks and can be opened in the genome browser by double-click on each of the tracks. Each short line (arrow in the higher zoom) represents an aligned “read” from the fastq file.

_images/rna_seq12.png

The initial alignments are sorted and reordered to prepare the next quality checking steps. The results of these two steps are stored in the folder tmp as the two files reorder.bam and sorted.bam.

The next step removes duplicates. The purpose is to mitigate the effects of PCR amplification bias introduced during library construction. Two read pairs are considered duplicate if they align to the same genomic position. The resulting MarkDuplikates1.log file is stored in the log folder and the MarkDuplikates1.stat file is stored in the stat folder.

The next step is a local realignment. Read mapping algorithms operate on each read independently, locally realign reads such that the number of mismatching bases is minimized across all the reads. Output files are Realigner.log and TargetCreator.log in the log folder, ddup1.bam, Realigned.bam and realigner.intervals in the tmp folder.

The realigned BAM file is used again to remove duplicates (output MarkDuplicates2.log and MarkDuplicates2.stat), because realignment may change genomic positions of read pairs. After this step additional duplicates can be identified. The next step is a recalibration of base quality values. For each base in each read various covariates (such as reported quality score, position in read, dinucleotide, read GC-content) are calculated. Using these values the algorithm builds the model that predicts sequencing errors. Then it applies this model to calculate an empirical base quality score and overwrites the phred quality score currently in the read. Output is a new BAM file (Good.bam).

_images/333e8c1a2463fe86fdf34093a73e4bba.png

This file is used for the unified GATK which is the next step of this workflow (Genome Analysis Toolkit) genotyper method to detect the SNP-indels (table in VCF format) which the user can visualize by double click.

_images/rna_seq13.png

After Zooming in information of variation on nucleotide basis is shown.

_images/rna_seq14.png

In the next step each identified variation (SNP_indels) is analysed with the help of the “variant_effect_predictor” algorithm

https://platform.genexplain.com/bioumlweb/#de=analyses/Galaxy/ensembl/variant_effect_predictor

As a result it creates a final table that gives detailed information about each variation as shown in the example below:

https://platform.genexplain.com/bioumlweb/#de=data/Examples/RNA-Seq%20analysis%20of%20human%20esophageal%20squamous%20cell%20carcinoma%20(ESCC)%2C%20GSE32424%2C%20FASTQ%20files/Data/SRR349741.fastq%20(Genome%20variants%20and%20indels%20from%20RNA-seq_hg38%20single)/variant%20effects

_images/rna_seq1.png

Detect differentially expressed gene (DEG)

In order to perform further analyses of the results of the workflow Quantification of RNA-seq with Cufflinks for FASTQ files it is recommended to join all resulting gene tables into one table using the function “Join several tables” of the platform. The joint table can be used for detection of differentially expressed genes using the Limma or EBarrays functions.

_images/rna_seq22.png

It should be noted here that to perform a Limma or EBarray analysis of the RNA-Seq data you should select the option Unnormalized counts in the input field Input log-base for either of these two methods, as shown below for Limma.

Estimate differential expression using Linear Models for MicroArrays (LIMMA)

Limma estimates differential expression between specified conditions / groups.

This tool provides an interface for the popular and comprehensive Limma package. The platform tool computes differential expression between up to five conditions / groups. The groups consist of columns of a data table that contains normalized measurement values, e.g. from a normalized microarray experiment. Furthermore, one can estimate differential expression for normalized or un-normalized count data as derived from RNA-seq experiments.

All possible contrasts between groups are considered and their output is stored in a common folder. Conditions are compared in the specified order from first to fifth. E.g. given conditions named A, B and C, the output will contain the contrasts AvsB, AvsC and BvsC.

It is necessary to provide a unique name for each group. Also, at least two data columns are required per group.

_images/rna_seq23.png

The input parameters for Limma are described in the following.

Input table: This table contains the columns to analyze.

Input log-base: Here you can specify the scale of the input data. If the log-base is log2, the tool will use the data values as is. If your data are from RNA-seq, you can select Normalized counts or Unnormalized counts.

1-5. Condition / group name: One can specify up to five groups of columns. Please note that unnamed groups are not considered; a name is not assigned automatically.

1-5. Columns: These fields contain the selected columns. Please note that column selections are not considered without a corresponding name. Columns can only be specified once and there need to be two columns per group.

Output folder: The output folder will contain one output file for each pair of conditions.

An example output table is shown at the end of this section. Its columns are explained in the following. Those highlighted in bold are shown in the default view. The other columns can be included on demand via the Columns tab of the lower right panel (available with opened output table).

logFC: Fold change (log)

CI.025: Fold change (Lower confidence interval)

CI.975: Fold change (Upper confidence interval)

AveExpr: Average log2-expression for the probe over all arrays

t: Moderated T-statistic

P.Value: P-value Differential expression

adj.P.Val: Adjusted P-value (Benjamini-Hochberg)

B: Log-odds that the gene / probe presents differential expression

_images/36f276275b4442d6cfb0fa1434cf8f9f.png

Reference:
Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, 2005.

Estimate differential expression by the gene expression mixture model of EBarrays

EBarrays estimates differential expression between specified conditions / groups.

This tool provides for differential expression analysis using the EBarrays package. The platform tool can compare up to five conditions / groups. The groups consist of columns of a data table that contains normalized measurement values, e.g. from a normalized microarray experiment.

EBarrays sets up a mixture model matching the specified groups. Differential expression is identified when components for a pattern describe the distribution of measurement values well. Then probe / gene values in the corresponding group were significantly different from their values in the other groups. This is reflected by high posterior probabilities in the column named after that group.

The package estimates a critical posterior probability cutoff for the given FDR level on the basis of the fitted mixture model. Probes / genes exceeding this cutoff in some condition / group are indicated by a value of 1 (instead of -1) in the output column named “condition name Sig”. Hence, to isolate the targets differentially expressed in a condition of interest, e.g. condition named “treatment”, filter the table for all rows with a value of 1 in the column “treatment Sig”. The direction of differential expression can be derived from the fold change column “condition name FC”, which contains the log2-fold changes.

_images/rna_seq24.png

The input parameters for EBarrays are described in the following.

Input table: This table contains the columns to analyze.

Input log-base: Here you can specify the scale of the input data. If the log-base is none, the tool will use the data values as is. If your data are from RNA-seq, you can select Normalized counts or Unnormalized counts.

1-5. Condition / group name: One can specify up to five groups of columns. Please note that unnamed groups are not considered; a name is not assigned automatically. Fields for the first two groups need to be set.

1-5. Columns: These fields contain the selected columns. Please note that column selections are not considered without a corresponding name. Columns can only be specified once and there need to be two columns per group. Fields for the first two groups need to be set.

1-5. Is control: Use this field to mark the control column group. One such group is required.

Output folder: The output folder will contain one output file for each pair of conditions.

It is necessary to provide a unique name for each group. Also, at least two data columns are required per group and one group needs to be marked as a control group.

Besides the main output table containing differential expression estimates for each probe / gene, EBarrays provides two diagnostic plots named EBarrays CCV and EBarrays Marginal fit. These plots enable a judgment about whether assumptions of the approach hold and how well the fitted model represents the data (please refer to the documentation of the EBarrays Bioconductor package for further details). Examples of an output table, a CCV plot and a Marginal fit plot are shown at the end of this section.

_images/image107.png

_images/931951aad4aaba4c23b5dd6413b946a6.png

_images/69183a6af9a6bef8bb30ceebc3fad7c5.png

Reference:
Kendziorski, C.M., Newton, M.A., Lan, H., Gould, M.N. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22:3899-3914.