Skip to content

celatlas/Celatlas_spatial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Celatlas Spatial - Spatial Transcriptomics Analysis Pipeline

Python Version License Version

Celatlas Spatial is a comprehensive spatial transcriptomics analysis pipeline that provides end-to-end processing from raw sequencing data to publication-ready results. Version 1.7.0 introduces major enhancements including H&E staining-guided analysis (HE mode), flexible parameter configuration, and optimized multi-threading support.

Installation & Setup

Installation Methods

Choose the installation method that best fits your situation:

# Option 1: Download and unzip the package, then install
# 1. Download the zip file from the release page
# 2. Navigate to the directory where you downloaded the file
# 3. Unzip the file to a folder
unzip celatlas_spatial-mian.zip
cd celatlas_spatial-mian
pip install dist/celatlas_spatial-1.7.0-py3-none-any.whl

# Option 2: git clone
git clone [repository_url]
cd Celatlas_spatial
pip install dist/celatlas_spatial-1.7.0-py3-none-any.whl

Download Deep Learning Model (Required)

IMPORTANT: Download the Swin-UNET model before running analysis:

# 1. Download swin_tiny_model.tar.gz (97MB) from GitHub Release

# 2. Extract to workspace src directory
mkdir -p /mnt/strna/celatlas_spatial/src
tar -xzf swin_tiny_model.tar.gz -C /mnt/strna/celatlas_spatial/src/

# If using custom workspace:
# mkdir -p $CELATLAS_WORKSPACE/src
# tar -xzf swin_tiny_model.tar.gz -C $CELATLAS_WORKSPACE/src/

# 3. Verify extraction
ls -lh /mnt/strna/celatlas_spatial/src/swin_tiny.pth
# Should show ~106MB swin_tiny.pth file

Model Details:

  • Filename: swin_tiny.pth
  • Size: ~106MB
  • Location: $CELATLAS_WORKSPACE/src/swin_tiny.pth
  • Purpose: Tissue segmentation for all analysis modes

Verification

Verify installation success:

celatlas_spatial --help

Requirements

  • Python: >= 3.9
  • External tools: STAR, SAMtools, Subread, TRUST4 (install via conda)
  • Python packages: Automatically installed (PyTorch, Pandas, NumPy, Plotly, Jinja2)
  • Hardware (HE mode):
    • GPU: NVIDIA GPU with CUDA 12 support (recommended for deep learning tissue segmentation)
    • CPU: Multi-core processor (16+ cores recommended)
    • RAM: 64GB+ (128GB recommended for large tissue sections)
    • Storage: 100GB+ free space for temporary files and results

Prerequisites

Must install...

Important: You must install the STAR aligner and other necessary software separately before using Celatlas Spatial:

conda install --file conda_pkgs.txt

Deep Learning Model Setup (Required)

IMPORTANT: You must download the Swin-UNET model for tissue segmentation:

# 1. Download swin_tiny_model.tar.gz from GitHub Release or provided link

# 2. Extract model to src directory
mkdir -p $CELATLAS_WORKSPACE/src
tar -xzf swin_tiny_model.tar.gz -C $CELATLAS_WORKSPACE/src/

# 3. Verify the model file exists
ls -lh $CELATLAS_WORKSPACE/src/swin_tiny.pth
# Should show ~106MB file

Notes:

  • Model size: ~106MB
  • Required for tissue segmentation in all analysis modes
  • Default workspace: /mnt/strna/celatlas_spatial (customizable via CELATLAS_WORKSPACE)

Reference Genome Setup

Before running analysis, you need to build reference genome indices:

1. Create reference directory structure:

mkdir -p reference/Homo_sapiens
or
mkdir -p reference/Mus_musculus
or
mkdir -p reference/others

2. Download genome files (example for human):

Note: For detailed parameters and the latest guidelines, please refer to the official STAR documentation at https://github.com/alexdobin/STAR."

cd reference/Homo_sapiens
# Download genome FASTA
wget http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# Download GTF annotation
wget http://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz

3. Build STAR index:

celatlas_spatial rna mkref \
    --genome_name Homo_sapiens \
    --fasta reference/Homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    --gtf reference/Homo_sapiens/Homo_sapiens.GRCh38.110.gtf \
    --thread 8

Quick Start

Using the Celatlas Pipeline Script

After installation, you can run the complete spatial transcriptomics analysis pipeline using the Celatlas.sh script. Version 1.7.0 introduces flexible parameter modes - you can use traditional positional arguments or modern named arguments, and optionally add performance tuning parameters.

# Modern usage (Recommended - with named arguments)
Celatlas.sh --sample <name> --chip <chip> --casno <case> --chemistry <chem> \
            --species <species> --method <method> --mode <mode> [OPTIONS]

# Legacy usage (Positional arguments - backward compatible)
Celatlas.sh <sample_name> <chip_number> <casno> <chemistry> <species> <method> <mode> [OPTIONS]
Celatlas.sh <chip_number> <casno> <chemistry> <species> <method> <mode> [OPTIONS]

# Hybrid mode (Recommended for flexibility)
Celatlas.sh <positional args...> --thread 32 --bin 10,50,100

Required Parameters

Parameter Description Valid Values Example
sample_name Sample identifier (optional in 6-param mode) Any string DEMO
chip_number Chip/slide number Unique identifier ST110001_A1
casno Case/project number for organizing results Project ID HE_TEST
chemistry Chemistry version of the kit BBV0, BBV2.4, BBV3 BBV2.4
species Target species for alignment Homo_sapiens, Mus_musculus, others Mus_musculus
method Analysis method (see detailed explanation below) HE, ssDNA, gene_expr HE
mode Pipeline mode strna, scrna strna

Analysis Methods - Detailed Explanation

Understanding the --method parameter is crucial for successful analysis:

Method Full Name When to Use Required Files Key Features
HE H&E Staining Guided Analysis Most common - When you have H&E stained histology images β€’ FASTQ files
β€’ {chip}_he.(tif|png|jpg)
β€’ Barcode position files
✨ NEW in v1.7.0
β€’ Gene expression-based tissue detection
β€’ HE image registration
β€’ Combined visualization
β€’ Optimal for histological context
ssDNA Single-strand DNA Probes When using fluorescent ssDNA tissue imaging β€’ FASTQ files
β€’ {chip}.tif (tissue image)
β€’ Barcode position files
β€’ Tissue boundary from imaging
β€’ Traditional image segmentation
β€’ Alias for image mode
gene_expr Gene Expression Only When you have NO imaging data β€’ FASTQ files
β€’ Barcode position files
β€’ Pure computational tissue detection
β€’ UMI-based segmentation
β€’ No image required

🚨 CRITICAL WARNING: HE Image File Naming

The _he suffix is MANDATORY and required!

Your Chip Number βœ… Correct Filename ❌ Common Mistakes
ST110001_A1 ST110001_A1_he.tif ST110001_A1.tif (missing _he)

| ST110001 | ST110001_he.png OR ST110001_HE.png | ST110001.png (missing _he/_HE) | | ST110001 | ST110001_he.jpg OR ST110001_HE.jpg | ST110001he.tif (missing underscore) |

Note: Case-insensitive - both _he and _HE (and _He, _hE) are supported!

| ST110001 | ST110001_he.png | ST110001_HE.tif (wrong case) | | ST110001 | ST110001_he.jpg | ST110001he.tif (missing underscore) |

Pipeline behavior:

  • With _he suffix β†’ HE mode (gene expression + H&E registration)
  • Without _he suffix β†’ ssDNA mode (looks for {chip}.tif fluorescent image)
  • Wrong case/format β†’ ERROR: HE mode requires H&E staining image!

Important Notes:

  • HE mode is the recommended method for most applications as it combines gene expression data with histological information
  • The HE image filename must follow the pattern: {chip_number}_he.(tif|png|jpg|jpeg)
  • For example, if chip_number=ST110001_A1, the HE image should be named ST110001_A1_he.tif (or .png, .jpg)

Optional Parameters (NEW in v1.7.0)

These parameters allow fine-tuning of performance and resolution:

Parameter Description Default Valid Range Recommendation
--thread Number of CPU threads Auto-detect 1-128 Use 80% of available cores
e.g., --thread 64 on 80-core machine
--bin Bin sizes in micrometers (comma-separated) 50 10,20,50,100 β€’ 10: Highest resolution (~single cell)
β€’ 20: High resolution
β€’ 50: Balanced (recommended)
β€’ 100: More genes, lower resolution
--insertR2 Insert R2 fragment size (bp) 150 50-300 Match your sequencing read length
--cell_num Expected number of cells/spots 50000 1000-200000 Adjust based on tissue size
--pixelSize Pixel size in micrometers 0.5 0.1-2.0 Check your imaging system specs

Path Customization (NEW in v1.7.0)

For users with shared data environments or custom storage layouts, you can now specify custom paths for reference genomes, images, and FASTQ files:

Parameter Description Default Use Case
--reference_dir Custom reference genome directory $WORKSPACE/reference Shared reference across multiple projects
--mask_dir Custom mask/barcode directory $WORKSPACE/ST_mask Centralized barcode storage
--image_dir Custom image directory $WORKSPACE/images Separate image storage location
--fastq_dir Custom FASTQ directory $WORKSPACE/fastq/$chemistry Avoid duplicate data copies, use original sequencing location
--fastq_name Custom FASTQ filename prefix chip_number When FASTQ files have different names than chip numbers

Benefits:

  • βœ… Avoid data duplication: Point directly to original sequencing files
  • βœ… Shared resources: Multiple users can share reference genomes and images
  • βœ… Flexible organization: Adapt to existing storage infrastructure
  • βœ… Storage savings: No need to copy large files into workspace

FASTQ File Format Support (Enhanced in v1.7.0)

The pipeline now supports 4 FASTQ naming formats with automatic detection:

Priority Format Pattern Example
1 Multi-lane (Recommended) {chip}_S*_L*_R1_*.fastq.gz ST110001_A1_S1_L001_R1_001.fastq.gz
ST110001_A1_S1_L002_R1_001.fastq.gz
2 Multi-fold (Legacy) {chip}_fold{1-5}_1.fq.gz ST110001_A1_fold1_1.fq.gz
ST110001_A1_fold2_1.fq.gz
3 Simple format {chip}_1.fq.gz ST110001_A1_1.fq.gz
ST110001_A1_2.fq.gz
4 _R1/_R2 format {chip}_R1.fq.gz ST110001_A1_R1.fq.gz
ST110001_A1_R2.fq.gz

Note: The pipeline automatically detects and uses the first available format.

Usage Examples

Example 1: Basic HE Mode (Most Common)

# Using H&E stained image for tissue analysis
# Required files: ST110001_A1_he.tif ⚠️ MUST have "_he" suffix!, FASTQ files, barcode files
bash Celatlas.sh DEMO ST110001_A1 HE_test BBV2.4 Mus_musculus HE strna

🚨 CRITICAL: HE Mode File Naming Rule

The HE image MUST have _he suffix before the file extension!

βœ… Correct: ST110001_A1_he.tif (or .png, .jpg, .jpeg)
❌ Wrong: ST110001_A1.tif β†’ This will be treated as ssDNA mode!
βœ… Also Correct: ST110001_A1_HE.tif β†’ Case-insensitive (both _he and _HE work!)
❌ Wrong: ST110001_A1he.tif β†’ Missing underscore before he

Why this matters: Without the _he suffix, the pipeline cannot detect HE mode and will fail with an error.

Example 2: HE Mode with Performance Optimization

# High-performance mode with 64 threads and single bin size
bash Celatlas.sh DEMO ST110001_A1 HE_test BBV2.4 Mus_musculus HE strna \
  --thread 64 --bin 50

Example 3: Multi-Resolution Analysis

# Generate multiple bin sizes for comparison (10Β΅m, 20Β΅m, 50Β΅m, 100Β΅m)
# Useful for exploring optimal spatial resolution
bash Celatlas.sh DEMO ST110001_A1 multi_res BBV2.4 Mus_musculus HE strna \
  --thread 32 --bin 10,20,50,100

Example 4: Gene Expression Only (No Image)

# When you don't have tissue images
bash Celatlas.sh DEMO ST110001_A1 expr_only BBV2.4 Mus_musculus gene_expr strna

Example 5: ssDNA Probe Mode

# When using fluorescent ssDNA tissue imaging
# Required: ST110001_A1.tif (tissue image)
bash Celatlas.sh DEMO ST110001_A1 ssDNA_test BBV2.4 Homo_sapiens ssDNA strna \
  --thread 48

Example 6: scRNA-seq Mode

# Single-cell RNA-seq analysis (no spatial information)
bash Celatlas.sh DEMO ST110001_A1 scrna_test BBV0 Homo_sapiens gene_expr scrna

Example 7: Full Custom Configuration

# Complete parameter customization
bash Celatlas.sh DEMO ST110001_A1 custom_test BBV2.4 Homo_sapiens HE strna \
  --thread 48 \
  --bin 10,20,50,100 \
  --insertR2 120 \
  --cell_num 80000 \
  --pixelSize 0.5

Example 8: Pure Named Arguments (Good for Scripts)

# Recommended for automation and reproducibility
bash Celatlas.sh \
  --sample DEMO \
  --chip ST110001_A1 \
  --casno batch_001 \
  --chemistry BBV2.4 \
  --species Mus_musculus \
  --method HE \
  --mode strna \
  --thread 32 \
  --bin 50

Example 9: Custom Directories for Shared Data Environment ✨ NEW

# Ideal for production environments with centralized storage
bash Celatlas.sh \
  --chip ST110001_A1 \
  --casno shared_proj_001 \
  --chemistry BBV2.4 \
  --species Mus_musculus \
  --method HE \
  --mode strna \
  --reference_dir /data/shared/reference \
  --fastq_dir /data/sequencing/run001 \
  --mask_dir /data/shared/masks \
  --image_dir /data/shared/images

Why use custom directories?

  • Avoid copying 100+ GB FASTQ files
  • Share reference genomes across multiple projects
  • Use original sequencing output location directly

Example 10: Custom FASTQ Name (Different from Chip Number) ✨ NEW

# When FASTQ files have different naming than chip numbers
bash Celatlas.sh \
  --chip ST110001_A1 \
  --casno test_001 \
  --chemistry BBV2.4 \
  --species Mus_musculus \
  --method HE \
  --mode strna \
  --fastq_name "Sample_ABC_XYZ"

# Pipeline will look for:
#   Sample_ABC_XYZ_S1_L001_R1_001.fastq.gz (multi-lane)
#   Sample_ABC_XYZ_fold1_1.fq.gz (multi-fold)
#   Sample_ABC_XYZ_1.fq.gz (simple)
#   Sample_ABC_XYZ_R1.fq.gz (R1/R2 format)

Configuration

Sample Naming Convention (IMPORTANT - Changed in v1.7.0)

Understanding the sample naming system is critical for file organization:

Two-Part Naming System

The pipeline uses a two-part naming system to distinguish between sample identity and technical replicate:

  1. sample_name (Optional): Biological sample identifier

    • Example: DEMO, Sample001, MouseBrain_A
    • Used for: Grouping related samples, metadata tracking
    • Can be omitted (will show as "N/A" in reports)
  2. chip_number (Required): Chip/slide identifier (the actual data identifier)

    • Example: ST110001_A1, ST110001, ST110001
    • Used for: Finding FASTQ files, matching image files, creating output directories
    • This is the primary identifier used throughout the pipeline

File Naming Requirements

All input files must use the chip_number as the base name:

File Type Naming Pattern Example Notes
FASTQ (Multi-lane) {chip}_S*_L*_R1_*.fastq.gz
{chip}_S*_L*_R2_*.fastq.gz
ST110001_A1_S1_L001_R1_001.fastq.gz
ST110001_A1_S1_L002_R1_001.fastq.gz
βœ… Recommended format
Automatically detects all lanes
FASTQ (Multi-fold) {chip}_fold1_1.fq.gz
{chip}_fold2_1.fq.gz
ST110001_A1_fold1_1.fq.gz
ST110001_A1_fold1_2.fq.gz
Legacy format
Supports fold1-fold5
FASTQ (Simple) {chip}_1.fq.gz
{chip}_2.fq.gz
ST110001_A1_1.fq.gz
ST110001_A1_2.fq.gz
Simple format
FASTQ (_R1/_R2) {chip}_R1.fq.gz
{chip}_R2.fq.gz
ST110001_A1_R1.fq.gz
ST110001_A1_R2.fq.gz
Common alternative format
HE Image {chip}_he.(tif|png|jpg|jpeg) ST110001_A1_he.tif Required for HE mode
Tissue Image {chip}.tif ST110001_A1.tif Required for ssDNA mode
Barcode Position {chip}.barcodeToPos.h5
{chip}_FilterBarcodes.csv
{chip}_tissue_bbox.csv
ST110001_A1.barcodeToPos.h5 Required for spatial modes

⚠️ IMPORTANT: HE Mode Image Naming

The most common error in HE mode is incorrect image file naming. Please note:

  • Mandatory suffix: The HE image MUST have _he before the file extension
  • Case sensitive: Must be lowercase _he, not _HE or _He
  • With underscore: Must be _he, not he (missing underscore will fail)
  • No extra text: Must be {chip}_he.ext, not {chip}_he_anything.ext

Examples:

  • βœ… ST110001_A1_he.tif β†’ Correct
  • βœ… ST110001_he.png β†’ Correct
  • ❌ ST110001_A1.tif β†’ Will be treated as ssDNA mode
  • ❌ ST110001_A1_HE.tif β†’ Case mismatch, will not be detected
  • ❌ ST110001_A1he.tif β†’ Missing underscore, will not be detected

FASTQ File Detection Priority

The pipeline searches for FASTQ files in this order:

Priority 1: Multi-lane format (Recommended)
  β”œβ”€β”€ Pattern: {chip}_S*_L*_R1_*.fastq.gz
  β”œβ”€β”€ Example: ST110001_A1_S1_L001_R1_001.fastq.gz
  β”‚           ST110001_A1_S1_L002_R1_001.fastq.gz
  └── Auto-detects and merges all lanes

Priority 2: Multi-fold format (Legacy)
  β”œβ”€β”€ Pattern: {chip}_fold{1-5}_1.fq.gz
  β”œβ”€β”€ Example: ST110001_A1_fold1_1.fq.gz
  β”‚           ST110001_A1_fold2_1.fq.gz
  └── Supports up to 5 fold files

Priority 3: Single file format (Simple)
  β”œβ”€β”€ Pattern: {chip}_1.fq.gz
  β”œβ”€β”€ Example: ST110001_A1_1.fq.gz
  └── Single file pair

Priority 4: _R1/_R2 format 
  β”œβ”€β”€ Pattern: {chip}_R1.fq.gz / {chip}_R2.fq.gz
  β”œβ”€β”€ Example: ST110001_A1_R1.fq.gz
  β”‚           ST110001_A1_R2.fq.gz
  └── Common alternative naming convention

Custom FASTQ Directory & Name: You can override the default FASTQ location and naming using --fastq_dir and --fastq_name parameters (see Example 9 and 10 above).

Comparison with Previous Versions

Version Sample Identifier File Base Name Example Command
v1.7.0 chip_number (primary)
sample_name (optional metadata)
Uses chip_number Celatlas.sh DEMO ST110001_A1 ...
v1.6.x sample only Uses sample Celatlas.sh Sample001 ...

Key Change: In v1.7.0, the chip_number parameter is now the primary identifier for file matching, while sample_name is optional metadata. This allows better organization when multiple samples share the same chip.

Practical Examples

Example 1: Standard HE Mode with Multi-lane FASTQ

# File structure:
# /mnt/strna/celatlas_spatial/
# β”œβ”€β”€ fastq/BBV2.4/
# β”‚   β”œβ”€β”€ ST110001_A1_S1_L001_R1_001.fastq.gz
# β”‚   β”œβ”€β”€ ST110001_A1_S1_L001_R2_001.fastq.gz
# β”‚   β”œβ”€β”€ ST110001_A1_S1_L002_R1_001.fastq.gz
# β”‚   └── ST110001_A1_S1_L002_R2_001.fastq.gz
# β”œβ”€β”€ images/
# β”‚   └── ST110001_A1_he.tif
# └── ST_mask/
#     β”œβ”€β”€ ST110001_A1.barcodeToPos.h5
#     β”œβ”€β”€ ST110001_A1_FilterBarcodes.csv
#     └── ST110001_A1_tissue_bbox.csv

bash Celatlas.sh DEMO ST110001_A1 HE_test BBV2.4 Mus_musculus HE strna

🚨 CRITICAL: HE Mode File Naming Rule

The HE image MUST have _he suffix before the file extension!

βœ… Correct: ST110001_A1_he.tif (or .png, .jpg, .jpeg)
❌ Wrong: ST110001_A1.tif β†’ This will be treated as ssDNA mode!
βœ… Also Correct: ST110001_A1_HE.tif β†’ Case-insensitive (both _he and _HE work!)
❌ Wrong: ST110001_A1he.tif β†’ Missing underscore before he

Why this matters: Without the _he suffix, the pipeline cannot detect HE mode and will fail with an error.

Example 2: Gene Expression Mode with Simple FASTQ

# File structure:
# /mnt/strna/celatlas_spatial/
# β”œβ”€β”€ fastq/BBV2.4/
# β”‚   β”œβ”€β”€ ST110001_1.fq.gz
# β”‚   └── ST110001_2.fq.gz
# └── ST_mask/
#     β”œβ”€β”€ ST110001.barcodeToPos.h5
#     β”œβ”€β”€ ST110001_FilterBarcodes.csv
#     └── ST110001_tissue_bbox.csv

bash Celatlas.sh DEMO ST110001 test_001 BBV2.4 Homo_sapiens gene_expr strna

Environment Variables

You can customize the pipeline behavior using environment variables:

Variable Description Default Value
CELATLAS_WORKSPACE Main workspace directory /your/workspace/path
MAX_PARALLEL_FILES Maximum parallel file processing 3

Setting Custom Paths

Option 1: Export environment variables

export CELATLAS_WORKSPACE="/home/user/my_workspace"
bash Celatlas.sh DEMO ST110001 CAS250801 BBV2.4 Mus_musculus image strna

Option 2: One-time usage

CELATLAS_WORKSPACE="/home/user/workspace" \
bash Celatlas.sh DEMO ST110001 CAS250801 strnaV2 human image strna

Option 3: Create a configuration script

# create_config.sh
#!/bin/bash
export CELATLAS_WORKSPACE="/your/workspace/path"
export MAX_PARALLEL_FILES=4

# Usage: source create_config.sh && Celatlas.sh ...

Directory Structure

The pipeline expects and creates the following directory structure:

$CELATLAS_WORKSPACE/
β”œβ”€β”€ binSegment/           # Segmentation results
β”œβ”€β”€ images/               # Input images
β”œβ”€β”€ fastq/               # FASTQ files organized by chemistry
β”‚   └── BBV2.4/
β”‚   
β”œβ”€β”€ reference/           # Reference genomes
β”‚   β”œβ”€β”€ Homo_sapiens/
β”‚   β”œβ”€β”€ Mus_musculus/
|   └── others/
β”‚   
β”œβ”€β”€ src/                 # Source files and models
β”œβ”€β”€ rawdata/            # Raw data per sample
└── results/            # Analysis results
    └── {casno}/        # Organized by case number
        └── {sample}/   # Sample-specific results

Pipeline Steps

The Celatlas.sh script runs the following analysis steps. The workflow varies depending on the selected method (HE, ssDNA, or gene_expr):

Standard Workflow (All Methods)

  1. Sample Processing (00.sample)

    • Sample validation and metadata collection
    • Chemistry detection and verification
    • System resource check
  2. Barcode Extraction (01.barcode)

    • Spatial barcode identification from R1 reads
    • Whitelist-based barcode correction
    • UMI extraction
    • Performance: Multi-file parallel processing (configurable with MAX_PARALLEL_FILES)
  3. Adapter Trimming (02.cutadapt)

    • Quality control and adapter removal from R2 reads
    • NextSeq-specific quality trimming
    • Insert size filtering based on --insertR2 parameter
  4. Sequence Alignment (03.star)

    • STAR alignment to reference genome
    • Threading: Utilizes --thread parameter for parallel alignment
    • Adaptive parameters based on mode (stricter for spatial, relaxed for scRNA)
  5. Feature Counting (04.featureCounts)

    • Gene expression quantification
    • Counts reads per gene per barcode
    • Threading: Multi-core counting enabled
  6. UMI Counting (05.count)

    • UMI deduplication
    • Cell/spot calling using EmptyDrops algorithm
    • Expected cell number based on --cell_num parameter
    • Generates filtered and raw count matrices

Method-Specific Segmentation (Step 6)

The binSegment step varies significantly based on the selected method:

πŸ”΅ HE Mode (Recommended - NEW in v1.7.0)

06.binSegment - Gene Expression + H&E Image Registration

This is the most advanced mode combining computational and visual tissue detection:

Step 6A: Gene Expression-based Tissue Detection
  β”œβ”€β”€ Aggregate UMI counts into spatial grid (default: 20Β΅m bins)
  β”œβ”€β”€ Apply UMI threshold filtering (default: β‰₯30 UMIs)
  β”œβ”€β”€ Enhance signal using percentile normalization (p_low=5, p_high=95)
  β”œβ”€β”€ Generate tissue mask from gene expression heatmap
  └── Create GEM (Gene Expression Matrix) tissue boundary

Step 6B: H&E Image Registration
  β”œβ”€β”€ Load H&E stained histology image ({chip}_he.tif/png/jpg)
  β”œβ”€β”€ Tissue segmentation using Swin-UNET deep learning model
  β”œβ”€β”€ Extract tissue region from H&E image
  β”œβ”€β”€ Register H&E tissue to GEM tissue boundary
  β”‚   β”œβ”€β”€ Registration method: Affine transformation (default)
  β”‚   β”œβ”€β”€ Alignment: SimpleITK with mutual information metric
  β”‚   └── Initial alignment: Contour-based initialization
  └── Generate aligned overlay visualization

Step 6C: Spatial Binning
  β”œβ”€β”€ Generate bins at specified resolutions (--bin parameter)
  β”‚   β”œβ”€β”€ bin10/  - 10Β΅m bins (~single cell resolution)
  β”‚   β”œβ”€β”€ bin20/  - 20Β΅m bins (high resolution)
  β”‚   β”œβ”€β”€ bin50/  - 50Β΅m bins (balanced, recommended)
  β”‚   └── bin100/ - 100Β΅m bins (more genes, lower resolution)
  β”œβ”€β”€ Assign barcodes to spatial bins
  β”œβ”€β”€ Aggregate UMI counts per bin
  └── Generate count matrices for each bin size

Key Features:

  • Dual tissue detection: Uses both gene expression AND histology
  • Robust registration: Aligns H&E image to gene expression coordinates
  • Interactive visualization: HTML report shows HE-GEM overlay with toggle controls
  • Quality metrics: Registration accuracy, tissue coverage statistics

Output Files (HE Mode):

06.binSegment/
β”œβ”€β”€ {sample}_tissue_HE.jpg              # Segmented H&E tissue region
β”œβ”€β”€ {sample}_1_tissue.png               # Gene expression tissue mask
β”œβ”€β”€ {sample}_2_he_registered.jpg        # Registered H&E image
β”œβ”€β”€ {sample}_3_overlay_combined.jpg     # HE + GEM overlay
β”œβ”€β”€ {sample}_3c_gem_heatmap_only.png    # Pure GEM heatmap (for download)
└── square_bin/
    β”œβ”€β”€ {sample}_bin10/
    β”‚   β”œβ”€β”€ filtered_feature_bc_matrix/
    β”‚   β”œβ”€β”€ stat.txt
    β”‚   └── downsample.tsv
    β”œβ”€β”€ {sample}_bin50/
    └── {sample}_bin100/

🟒 ssDNA Mode (Image-based)

06.binSegment - Fluorescent Image Segmentation

Traditional image-based tissue detection using fluorescent ssDNA probes:

Step 6: Image Segmentation + Spatial Binning
  β”œβ”€β”€ Load tissue image ({chip}.tif)
  β”œβ”€β”€ Tissue segmentation using Swin-UNET model
  β”œβ”€β”€ Extract tissue boundary
  β”œβ”€β”€ Generate spatial bins (--bin parameter)
  β”œβ”€β”€ Assign barcodes to bins
  └── Create count matrices per bin

Required: High-quality fluorescent tissue image ({chip}.tif)

🟑 gene_expr Mode (No Image)

06.binSegment - Pure Computational Segmentation

Tissue detection based solely on gene expression data:

Step 6: Gene Expression-based Segmentation
  β”œβ”€β”€ Aggregate UMI counts into spatial grid (--gem-bin-size, default: 20Β΅m)
  β”œβ”€β”€ Apply UMI threshold (--umi-min-threshold, default: β‰₯30)
  β”œβ”€β”€ Enhance signal (--enhance-params)
  β”œβ”€β”€ Detect tissue boundary from expression pattern
  β”œβ”€β”€ Generate spatial bins (--bin parameter)
  └── Create count matrices per bin

Advantages: No imaging required, works with any spatial platform

Analysis and Visualization (Steps 7-8)

  1. Spatial Analysis (07.analysis)

    • Dimensionality reduction (UMAP, t-SNE*)
    • Clustering (Leiden algorithm)
    • Marker gene identification
    • Spatial expression patterns
    • Performance:
      • Uses --thread for parallel computation
      • Smart t-SNE handling: For high-resolution data (bin10, bin20), t-SNE is automatically skipped to prevent memory issues
      • UMAP is always computed for all bin sizes
  2. Report Generation

    • Interactive HTML report with Plotly visualizations
    • HE Mode: Includes Image Alignment Viewer with HE-GEM overlay
    • All Modes: QC metrics, spatial plots, clustering results
    • Downloadable figures (PNG format)

Performance Optimization

The pipeline includes several performance enhancements in v1.7.0:

  • Auto-threading: Automatically detects CPU cores if --thread not specified
  • BLAS thread limiting: Prevents segmentation faults on high-core servers (auto-limits to 16 threads)
  • Parallel barcode extraction: Multiple FASTQ files processed concurrently
  • Smart t-SNE skipping: High-resolution bins skip t-SNE to avoid memory issues
  • Intermediate file cleanup: Optional automatic cleanup of intermediate files (controlled by CLEAN_INTERMEDIATE variable)

Thread Configuration Details:

# Auto-detected (recommended for most users)
bash Celatlas.sh ... HE strna  # Uses all available cores

# Manual specification (for high-performance servers)
bash Celatlas.sh ... HE strna --thread 64

# Conservative mode (for shared servers)
bash Celatlas.sh ... HE strna --thread 16

Internal Thread Management:

  • Analysis threads: Set by --thread parameter (e.g., 64)
  • BLAS threads: Auto-limited to 16 (prevents OpenBLAS segfault)
  • Environment variables set:
    • OMP_NUM_THREADS: Analysis threads
    • OPENBLAS_NUM_THREADS: Limited to 16 (max)
    • MKL_NUM_THREADS: Limited to 16 (max)
    • NUMEXPR_NUM_THREADS: Analysis threads

Output Files

The pipeline generates comprehensive outputs organized by case number and sample:

Output Directory Structure

$CELATLAS_WORKSPACE/results/{casno}/{chip_number}/
β”œβ”€β”€ 00.sample/
β”‚   └── stat.json                           # Sample metadata and chemistry info
β”œβ”€β”€ 01.barcode/
β”‚   β”œβ”€β”€ {chip}_1.fq.gz                      # Barcode-corrected R1
β”‚   β”œβ”€β”€ {chip}_2.fq.gz                      # Barcode-corrected R2
β”‚   └── stat.json                           # Barcode extraction statistics
β”œβ”€β”€ 02.cutadapt/
β”‚   β”œβ”€β”€ {chip}_clean_2.fq.gz                # Adapter-trimmed R2
β”‚   └── stat.json                           # Trimming statistics
β”œβ”€β”€ 03.star/
β”‚   β”œβ”€β”€ {chip}_Aligned.sortedByCoord.out.bam
β”‚   β”œβ”€β”€ {chip}_Log.final.out                # Alignment summary
β”‚   └── stat.json                           # STAR statistics
β”œβ”€β”€ 04.featureCounts/
β”‚   β”œβ”€β”€ {chip}_nameSorted.bam               # Name-sorted BAM for counting
β”‚   β”œβ”€β”€ {chip}_counts.txt                   # Gene-level counts
β”‚   └── stat.json                           # Feature counting stats
β”œβ”€β”€ 05.count/
β”‚   β”œβ”€β”€ {chip}_count_detail.txt             # Detailed UMI counts per barcode
β”‚   β”œβ”€β”€ {chip}_filtered_feature_bc_matrix/  # Filtered count matrix (cells)
β”‚   β”‚   β”œβ”€β”€ barcodes.tsv.gz
β”‚   β”‚   β”œβ”€β”€ features.tsv.gz
β”‚   β”‚   └── matrix.mtx.gz
β”‚   β”œβ”€β”€ {chip}_raw_feature_bc_matrix/       # Raw count matrix (all barcodes)
β”‚   └── stat.json                           # Cell calling statistics
β”œβ”€β”€ 06.binSegment/
β”‚   β”œβ”€β”€ {chip}_tissue_HE.jpg                # [HE mode] Segmented H&E tissue
β”‚   β”œβ”€β”€ {chip}_1_tissue.png                 # [HE mode] Gene expression mask
β”‚   β”œβ”€β”€ {chip}_2_he_registered.jpg          # [HE mode] Registered H&E
β”‚   β”œβ”€β”€ {chip}_3_overlay_combined.jpg       # [HE mode] HE+GEM overlay
β”‚   β”œβ”€β”€ {chip}_3c_gem_heatmap_only.png      # [HE mode] Pure GEM heatmap
β”‚   └── square_bin/
β”‚       β”œβ”€β”€ {chip}_bin10/                   # 10Β΅m resolution
β”‚       β”‚   β”œβ”€β”€ filtered_feature_bc_matrix/
β”‚       β”‚   β”œβ”€β”€ stat.txt                    # Bin-specific statistics
β”‚       β”‚   └── downsample.tsv              # Saturation curve data
β”‚       β”œβ”€β”€ {chip}_bin20/                   # 20Β΅m resolution
β”‚       β”œβ”€β”€ {chip}_bin50/                   # 50Β΅m resolution (recommended)
β”‚       └── {chip}_bin100/                  # 100Β΅m resolution
β”œβ”€β”€ 07.analysis/
β”‚   β”œβ”€β”€ {chip}_bin{size}_cluster.tsv        # Cluster assignments
β”‚   β”œβ”€β”€ {chip}_bin{size}_umap.tsv           # UMAP coordinates
β”‚   β”œβ”€β”€ {chip}_bin{size}_tsne.tsv           # t-SNE coordinates (if generated)
β”‚   β”œβ”€β”€ {chip}_bin{size}_markers.csv        # Marker genes per cluster
β”‚   └── spatial_plots/                      # Spatial visualization PNGs
β”œβ”€β”€ {chip}_spatial_analysis_report.html     # ⭐ Main interactive report
└── pipeline.log                            # Complete pipeline execution log

Key Output Files

File Description Use Case
{chip}_spatial_analysis_report.html Main deliverable - Interactive HTML report with all QC metrics, plots, and visualizations Share with collaborators, publication figures
06.binSegment/square_bin/{chip}_bin50/filtered_feature_bc_matrix/ Count matrix for downstream analysis Load into Seurat, Scanpy, or other tools
06.binSegment/{chip}_3_overlay_combined.jpg [HE mode] HE-GEM alignment visualization Verify registration quality
07.analysis/{chip}_bin{size}_cluster.tsv Spatial cluster assignments Downstream spatial analysis
pipeline.log Complete execution log with timing and errors Troubleshooting, performance analysis

Report Features (HTML)

The interactive HTML report includes:

QC Metrics Section:

  • Total reads, mapped reads, genes detected
  • Sequencing saturation curve
  • Barcode rank plot
  • Median genes per square

Spatial Visualization:

  • UMI distribution heatmap
  • Gene expression overlay on tissue
  • Cluster spatial distribution

Clustering Analysis:

  • UMAP/t-SNE embeddings colored by cluster
  • Top marker genes per cluster
  • Cluster composition statistics

[HE Mode Only] Image Alignment Viewer:

  • Interactive overlay of H&E and gene expression
  • Toggle visibility of HE/GEM layers
  • Opacity slider for blend control
  • Download buttons for HE-only, GEM-only, or combined views

Command Line Tools

Individual pipeline components are also available as command-line tools:

# RNA analysis subcommands
celatlas_spatial rna sample --help
celatlas_spatial rna barcode --help
celatlas_spatial rna cutadapt --help
celatlas_spatial rna star --help
celatlas_spatial rna featureCounts --help
celatlas_spatial rna count --help
celatlas_spatial rna binSegment --help
celatlas_spatial rna analysis --help

Additional Notes

All software dependencies and system requirements are detailed in the Installation & Setup section above. Python packages are automatically installed with pip, but bioinformatics tools like STAR and SAMtools need to be installed separately as described in the Prerequisites section.

Support

For issues and questions:

  • Create an issue on GitHub
  • Contact: rd@celatlas.com or lqs60667106@gmail.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Celatlas Spatial in your research, please cite:

[https://github.com/](https://github.com/celatlas/Celatlas_spatial/)

About

Celatlas Spatial is a powerful Python-based bioinformatics toolkit designed for analyzing spatial transcriptomics data. This pipeline provides an end-to-end solution for processing, analyzing, and visualizing spatially resolved single-cell RNA sequencing experiments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors