nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.
This repository is a customized version of nf-core/rnaseq v3.14.0, adapted with specific plugins for the UKHD environment. The pipeline includes essential default parameters and disables unnecessary features. For detailed information, refer to the full nf-core/rnaseq documentation here.
-
Prepare Reference Files: Obtain the reference genome, annotation file, and bed file.
-
Sub-sample FastQ Files and Infer Strandedness: Use
fqandSalmonto sub-sample FastQ files and auto-infer strandedness. -
Read Quality Control (QC): Assess read quality with
FastQC. -
Adapter and Quality Trimming: Trim adapters, low-quality bases, and polyA tails using [
Trim Galore!]https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). -
Alignment and Quantification
-
Sort and Index Alignments: Use
SAMtoolsto sort and index alignment files. -
Duplicate Read Marking: Mark duplicate reads (do not remove) with
picard MarkDuplicates. -
Transcript Assembly and Quantification: Assemble transcripts and perform quantification using
StringTie. -
Create BigWig Coverage Files: Create
bigWigfiles for IGV to visualize coverage tracks and analyze read distribution and genomic features usingBEDToolsandbedGraphToBigWig. -
Extensive Quality Control
-
Present QC Results: Present quality control metrics for raw reads, alignment, gene biotype, sample similarity, and strand-specificity using
MultiQCandR.
First, install Nextflow using Conda and set up the required plugins and environment variables.
Copy code
# Install Nextflow via Conda
$ conda install -c bioconda nextflow
# Clone the pipeline repository
git clone https://github.com/lucianhu/rnaseq_nextflow.git
# Copy plugins to Nextflow directory
cp rnaseq_nextflow/plugins ~/.nextflow
# Set the Conda cache directory for Nextflow
echo "export NXF_CONDA_CACHEDIR=$HOME/.conda/cache_nextflow" >> ~/.bashrc
# Source the updated bashrc to apply the changes
source ~/.bashrcNote
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.
First, prepare a samplesheet with your input data. The samplesheet should be a CSV file formatted as follows:
samplesheet.csv
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L001_R1_001.fastq.gz,AEG588A1_S1_L001_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP2,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP2,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
CONTROL_REP3,AEG588A1_S1_L005_R1_001.fastq.gz,AEG588A1_S1_L005_R2_001.fastq.gz,auto
CONTROL_REP3,AEG588A1_S1_L006_R1_001.fastq.gz,AEG588A1_S1_L006_R2_001.fastq.gz,auto
TREATMENT_REP1,AEG588A1_S1_L007_R4_001.fastq.gz,AEG588A1_S1_L007_R2_001.fastq.gz,auto
TREATMENT_REP1,AEG588A1_S1_L008_R1_001.fastq.gz,AEG588A1_S1_L008_R2_001.fastq.gz,auto
TREATMENT_REP2,AEG588A1_S1_L009_R5_001.fastq.gz,AEG588A1_S1_L009_R2_001.fastq.gz,auto
TREATMENT_REP2,AEG588A1_S1_L010_R1_001.fastq.gz,AEG588A1_S1_L010_R2_001.fastq.gz,auto
TREATMENT_REP3,AEG588A1_S1_L011_R1_001.fastq.gz,AEG588A1_S1_L011_R2_001.fastq.gz,auto
TREATMENT_REP3,AEG588A1_S1_L012_R1_001.fastq.gz,AEG588A1_S1_L012_R2_001.fastq.gz,autoEach row represents a single-end fastq file or a pair of fastq files (paired-end). Rows with the same sample identifier are considered technical replicates and will be merged automatically. The strandedness of the library preparation will be automatically inferred if set to auto.
Prepare a YAML file with the necessary pipeline parameters.
nf-params-rna.yaml
input: '/path/to/samplesheet.csv'
outdir: '/path/to/output_dir'
fasta: '/path/to/references/genome.fa'
gtf: '/path/to/references/annotation.gtf'
gene_bed: '/path/to/references/annotation.bed' # If this file is not available, the pipeline will automatically create it
star_index: '/path/to/genome/star_index_dir'
save_trimmed: true
save_align_intermeds: trueWarning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters;
see docs.
Execute the Nextflow pipeline using the following command:
$ nextflow run /path/to/rnaseq_nextflow/main.nf -profile conda -params-file nf-params-rna.yaml -work-dir path/to/work_dirWarning
Running Nextflow on Docker is insecure and requires administrative privileges. Using Conda is more secure and allows users to run Nextflow without needing special permissions.
To run a test sample, navigate to the test datasets directory and execute the pipeline:
$ cd rnaseq_nextflow/test-datasets
$ nextflow run /path/to/rnaseq_nextflow/main.nf -profile conda -params-file nf-params-rna.yaml -work-dir path/to/work_dirThis will process the test datasets and ensure the pipeline is functioning correctly.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
For a summary of the results and a brief explanation of the tools used in the UKHD system, please visit: UKHD Output Documentation. For detailed outputs and reports with full parameters, refer to the nf-core output documentation.
This pipeline quantifies RNA-sequenced reads against genes/transcripts in the genome and normalizes the data. However, it does not perform statistical comparisons to determine significance using FDR or P-values. For further analysis, you can examine the output files in statistical environments such as R or Julia, or use the nf-core/differentialabundance pipeline.
A short talk about the history, current status and functionality on offer in this pipeline was given by Harshil Patel (@drpatelh) on 8th February 2022 as part of the nf-core/bytesize series.
You can find numerous talks on the nf-core events page from various topics including writing pipelines/modules in Nextflow DSL2, using nf-core tooling, running nf-core pipelines as well as more generic content like contributing to Github. Please check them out!
These scripts were originally written for use at the National Genomics Infrastructure, part of SciLifeLab in Stockholm, Sweden, by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn).
The pipeline was re-written in Nextflow DSL2 and is primarily maintained by Harshil Patel (@drpatelh) from Seqera Labs, Spain.
The pipeline workflow diagram was initially designed by Sarah Guinchard (@G-Sarah) and James Fellows Yates (@jfy133), further modifications where made by Harshil Patel (@drpatelh) and Maxime Garcia (@maxulysse).
Many thanks to other who have helped out along the way too, including (but not limited to):
- Alex Peltzer
- Colin Davenport
- Denis Moreno
- Edmund Miller
- Gregor Sturm
- Jacki Buros Novik
- Lorena Pantano
- Matthias Zepper
- Maxime Garcia
- Olga Botvinnik
- @orzechoj
- Paolo Di Tommaso
- Rob Syme
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #rnaseq channel (you can join with this invite).
If you use nf-core/rnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.


