GitHub - lucianhu/rnaseq_nextflow: RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.

Introduction

nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.

This repository is a customized version of nf-core/rnaseq v3.14.0, adapted with specific plugins for the UKHD environment. The pipeline includes essential default parameters and disables unnecessary features. For detailed information, refer to the full nf-core/rnaseq documentation here.

RNA-seq Data Analysis Workflow

Prepare Reference Files: Obtain the reference genome, annotation file, and bed file.
Sub-sample FastQ Files and Infer Strandedness: Use fq and Salmon to sub-sample FastQ files and auto-infer strandedness.
Read Quality Control (QC): Assess read quality with FastQC.
Adapter and Quality Trimming: Trim adapters, low-quality bases, and polyA tails using [Trim Galore!]https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).
Alignment and Quantification
- Perform alignment with STAR.
- Quantify transcripts with Salmon.
Sort and Index Alignments: Use SAMtools to sort and index alignment files.
Duplicate Read Marking: Mark duplicate reads (do not remove) with picard MarkDuplicates.
Transcript Assembly and Quantification: Assemble transcripts and perform quantification using StringTie.
Create BigWig Coverage Files: Create bigWig files for IGV to visualize coverage tracks and analyze read distribution and genomic features using BEDTools and bedGraphToBigWig.
Extensive Quality Control
- Technical/Biological Read Duplication: Assess with dupRadar.
- Various RNA-seq QC Metrics: Evaluate using Qualimap and RSeQC.
Present QC Results: Present quality control metrics for raw reads, alignment, gene biotype, sample similarity, and strand-specificity using MultiQC and R.

Installation

First, install Nextflow using Conda and set up the required plugins and environment variables.

Copy code
# Install Nextflow via Conda
$ conda install -c bioconda nextflow

# Clone the pipeline repository
git clone https://github.com/lucianhu/rnaseq_nextflow.git

# Copy plugins to Nextflow directory
cp rnaseq_nextflow/plugins ~/.nextflow

# Set the Conda cache directory for Nextflow
echo "export NXF_CONDA_CACHEDIR=$HOME/.conda/cache_nextflow" >> ~/.bashrc

# Source the updated bashrc to apply the changes
source ~/.bashrc

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.

Usage

Prepare a Samplesheet

First, prepare a samplesheet with your input data. The samplesheet should be a CSV file formatted as follows:

samplesheet.csv

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L001_R1_001.fastq.gz,AEG588A1_S1_L001_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP2,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP2,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
CONTROL_REP3,AEG588A1_S1_L005_R1_001.fastq.gz,AEG588A1_S1_L005_R2_001.fastq.gz,auto
CONTROL_REP3,AEG588A1_S1_L006_R1_001.fastq.gz,AEG588A1_S1_L006_R2_001.fastq.gz,auto
TREATMENT_REP1,AEG588A1_S1_L007_R4_001.fastq.gz,AEG588A1_S1_L007_R2_001.fastq.gz,auto
TREATMENT_REP1,AEG588A1_S1_L008_R1_001.fastq.gz,AEG588A1_S1_L008_R2_001.fastq.gz,auto
TREATMENT_REP2,AEG588A1_S1_L009_R5_001.fastq.gz,AEG588A1_S1_L009_R2_001.fastq.gz,auto
TREATMENT_REP2,AEG588A1_S1_L010_R1_001.fastq.gz,AEG588A1_S1_L010_R2_001.fastq.gz,auto
TREATMENT_REP3,AEG588A1_S1_L011_R1_001.fastq.gz,AEG588A1_S1_L011_R2_001.fastq.gz,auto
TREATMENT_REP3,AEG588A1_S1_L012_R1_001.fastq.gz,AEG588A1_S1_L012_R2_001.fastq.gz,auto

Each row represents a single-end fastq file or a pair of fastq files (paired-end). Rows with the same sample identifier are considered technical replicates and will be merged automatically. The strandedness of the library preparation will be automatically inferred if set to auto.

Prepare a Parameters File

Prepare a YAML file with the necessary pipeline parameters.

nf-params-rna.yaml

input: '/path/to/samplesheet.csv'
outdir: '/path/to/output_dir'
fasta: '/path/to/references/genome.fa'
gtf: '/path/to/references/annotation.gtf'
gene_bed: '/path/to/references/annotation.bed' # If this file is not available, the pipeline will automatically create it
star_index: '/path/to/genome/star_index_dir'
save_trimmed: true
save_align_intermeds: true

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Run the Pipeline

Execute the Nextflow pipeline using the following command:

$ nextflow run /path/to/rnaseq_nextflow/main.nf -profile conda -params-file nf-params-rna.yaml -work-dir path/to/work_dir

Warning

Running Nextflow on Docker is insecure and requires administrative privileges. Using Conda is more secure and allows users to run Nextflow without needing special permissions.

Run Testing Sample

To run a test sample, navigate to the test datasets directory and execute the pipeline:

$ cd rnaseq_nextflow/test-datasets
$ nextflow run /path/to/rnaseq_nextflow/main.nf -profile conda -params-file nf-params-rna.yaml -work-dir path/to/work_dir

This will process the test datasets and ensure the pipeline is functioning correctly.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

For a summary of the results and a brief explanation of the tools used in the UKHD system, please visit: UKHD Output Documentation. For detailed outputs and reports with full parameters, refer to the nf-core output documentation.

This pipeline quantifies RNA-sequenced reads against genes/transcripts in the genome and normalizes the data. However, it does not perform statistical comparisons to determine significance using FDR or P-values. For further analysis, you can examine the output files in statistical environments such as R or Julia, or use the nf-core/differentialabundance pipeline.

Online videos

A short talk about the history, current status and functionality on offer in this pipeline was given by Harshil Patel (@drpatelh) on 8th February 2022 as part of the nf-core/bytesize series.

You can find numerous talks on the nf-core events page from various topics including writing pipelines/modules in Nextflow DSL2, using nf-core tooling, running nf-core pipelines as well as more generic content like contributing to Github. Please check them out!

Credits

These scripts were originally written for use at the National Genomics Infrastructure, part of SciLifeLab in Stockholm, Sweden, by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn).

The pipeline was re-written in Nextflow DSL2 and is primarily maintained by Harshil Patel (@drpatelh) from Seqera Labs, Spain.

The pipeline workflow diagram was initially designed by Sarah Guinchard (@G-Sarah) and James Fellows Yates (@jfy133), further modifications where made by Harshil Patel (@drpatelh) and Maxime Garcia (@maxulysse).

Many thanks to other who have helped out along the way too, including (but not limited to):

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #rnaseq channel (you can join with this invite).

Citations

If you use nf-core/rnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 4,579 Commits
.devcontainer		.devcontainer
.github		.github
assets		assets
bin		bin
conf		conf
docs		docs
lib		lib
modules		modules
plugins		plugins
subworkflows		subworkflows
test-datasets		test-datasets
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
pyproject.toml		pyproject.toml
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

RNA-seq Data Analysis Workflow

Installation

Usage

Prepare a Samplesheet

Prepare a Parameters File

Run the Pipeline

Run Testing Sample

Pipeline output

Online videos

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

RNA-seq Data Analysis Workflow

Installation

Usage

Prepare a Samplesheet

Prepare a Parameters File

Run the Pipeline

Run Testing Sample

Pipeline output

Online videos

Credits

Contributions and Support

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages