TBEL CDMC-Recommended Data Analysis Methods/Pipelines

bgtaylor1
Dec 11, 2024
12 min read

Updated: Jan 10

Provided by Ziyi Li & Lulu Shang

For further assistance, please contact TBEL-CDMC@mdanderson.org.

TBEL CDMC-Recommended Data Analysis Methods/Pipelines

The TBEL program held its inaugural Data Science mini-Symposium on October 13th, 2024. Data science leaders from ARTNet (Dr. Alan Hutson) and PSRC (Dr. Linghua Wang) shared their experiences in coordinating data analysis efforts and molecular data analysis. Data scientists and trainees from five TBEL sites also presented their data challenges and shared their analytical approaches to early lesion analysis. The recommended list of methods and pipelines has been primarily compiled from presentations at this meeting, with additional insights from CDMC researchers. This list will be continuously updated as the research program advances and new insights are gained from future Data Science mini-symposium.

________________________________________________________________________________________________________

Spatial Transcriptomics

Schematic plot	Purpose	Methods
	QC, Normalization, Integration	SpotClean
	Cluster	BayesSpace, BASS, SpaGCN
	Differential Analysis	Seurat, RegionalST
	SVG detection	SPARK, SPARK-x, nnSVG, CELINA
	Dimension reduction and multi-dataset integration	SpatialPCA, PRECAST
	Multi-omics integration	SpatialGLUE
	Deconvolution	CARD, Redeconve, SPOTlight, RCTD, BayesPrism
	Imputation and alignment	CytoSPACE, PASTA, CellTrek, Tangram
	Visualization	SpatialView
	Cell-cell interaction	SpaCCI, CellChat, MultiNicheNetR, COMMOT, CellPhoneDB, NicheDE/NicheNet
	Power analysis	PoweREST
	Super-resolution	iStar, BayesSapce
	Cell segmentation	Baysor, SCS, DeepCell, Cellpose, StarDist
	Integrating histology and morphology in ST analysis	METI

Recommended analysis steps: quality control (QC), normalization, data integration, clustering, differentially expression gene analysis, cell-cell interaction analysis,

QC, Normalization, Integration

SpotClean[L1] improves the quality of spatial transcriptomics data by removing ambient RNA contamination, ensuring that gene expression measurements accurately reflect local cellular environments. This decontamination process enhances normalization and integration across datasets, allowing for more reliable cross-sample and cross-condition analyses.

Clustering (BayesSpace and Louvian are recommended over BASS and SpaGCN in experience-sharing by TBEL members, but these methods may perform differently in other problems)

BayesSpace[L1] is a Bayesian tool that enhances clustering and spatial resolution in spatial transcriptomics data by leveraging spatial relationships between spots, refining clusters at a sub-spot level. This spatially-aware approach enables the detection of finer tissue structures and distinct spatial domains, offering deeper insights into cellular organization and tissue architecture.
BASS[L2] (Bayesian Analytic Spatial Smoothing) is a Bayesian method for clustering spatial transcriptomics data that combines spatial smoothing with gene expression information to improve clustering accuracy. By incorporating spatial dependencies between neighboring spots, BASS provides more precise and biologically meaningful clusters, enhancing the analysis of tissue structure and cellular composition.
SpaGCN[L3] (Spatial Graph Convolutional Network) is a deep learning method for clustering spatial transcriptomics data that combines gene expression with spatial information by modeling tissue structure as a graph. Leveraging graph convolutional networks, SpaGCN improves the detection of spatial domains within tissues, providing more accurate insights into cell organization and microenvironments.

Differential Expression Analysis (DE Analysis)

Wilcoxon test in Seurat package
RegionalST[ZL5] is a computational method designed for spatial transcriptomics data analysis, enabling the identification of Regions of Interests (ROIs) and signal comparisons across ROIs. The “GetCellTypeSpecificDE_withProp()” function in RegionalST package (https://github.com/ziyili20/RegionalST) provides cell type-specific differential analysis by incorporation of cell type proportions in the DE process.

Spatially variable gene (SVG) detection (SPARK-X has good performance on large datasets and can handle diverse data types, making it a more broadly applicable choice. CELINA is specifically designed to identify cell type-specific spatially variable genes. However, the best method may depend on the specific dataset and research question, so it's always advisable to compare multiple methods.)

SPARK [Sh6] (Spatial Pattern Recognition via Kernels) is designed for identifying SVGs in spatial transcriptomics data. These methods employ spatial kernel-based tests to detect genes with significant spatial expression patterns.
SPARK-X[Sh7] employs spatial kernel-based tests to detect genes with significant spatial expression patterns, and offers improved computational efficiency and scalability for large datasets.
nnSVG [Sh8] (nearest-neighbor Spatial analysis of Variable Genes) leverages nearest-neighbor graphs to capture local spatial dependencies in gene expression. It uses a statistical framework to identify genes with significant spatial patterns while accounting for the overall expression level.
CELINA [Sh9] focuses on the detection of a subset of SVGs that display diverse spatial expression patterns within a given cell type.

Dimension reduction and multi-sample data integration

SpatialPCA [Sh10] is a spatially-aware dimension reduction method specifically designed for spatial transcriptomics data. It extracts low-dimensional representations of the data while preserving the spatial correlation structure across tissue locations. It enables effective downstream analyses, such as spatial domain detection, trajectory inference and high-resolution map reconstruction, by incorporating the rich localization information inherent in spatial transcriptomics data.
PRECAST [Sh11] is a probabilistic model that simultaneously estimates embeddings for cellular biological effects, performs spatial clustering, and aligns the estimated embeddings across multiple tissue sections.

Multi-omics data integration

SpatialGLUE [Sh12] is a computational method for integrating spatial multi-omics data. It constructs spatial proximity and feature similarity graphs for each modality, using attention aggregation to adaptively integrate different data types. This approach is applicable to various spatial multi-omics platforms and demonstrates improved resolution in identifying spatial domains across different tissue types.

Deconvolution (CARD and RCTD are recommended over Redeconve and SPOTlight in experience-sharing by TBEL members, but these methods may perform differently in other problems)

CARD[ZL13] (Cell-type Assignment using Reference-based Deconvolution) uses a probabilistic model to deconvolve spatial transcriptomics data by estimating the cell-type composition of each spot based on reference single-cell RNA-seq profiles. It applies a spatial smoothing approach to enhance accuracy by incorporating neighboring spot information.
Redeconve[ZL14] is a reference-based deconvolution tool that estimates cell-type proportions in spatial transcriptomics data using predefined single-cell references. By aligning spatial and single-cell data, it infers cell-type distributions across tissue sections, helping to reveal spatial patterns.
SPOTlight[ZL15] is a machine learning-based method that uses non-negative matrix factorization (NMF) to deconvolve spatial transcriptomics data, leveraging single-cell RNA-seq references. This approach assigns cell-type proportions to each spatial spot, providing insights into cellular heterogeneity and spatial organization.
RCTD[ZL16] (Robust Cell Type Decomposition) performs cell-type deconvolution by modeling the expression of each spot as a weighted combination of reference cell types from single-cell RNA-seq data. It uses a probabilistic approach to estimate cell-type proportions, allowing for robust cell-type mapping across spatial spots.
BayesPrism[ZL17] is a Bayesian deconvolution tool that uses single-cell RNA-seq reference data to infer cell-type composition and gene expression profiles in spatial transcriptomics data. By accounting for uncertainty in cell-type proportions, BayesPrism provides robust estimates of cellular composition, enhancing insights into the spatial organization of tissues.

Imputation and Alignment with single cell RNA-seq data

CytoSPACE[ZL18] is a computational tool designed for spatial transcriptomics analysis that maps single-cell RNA-seq data onto spatial transcriptomics data to infer cell-type locations across tissue sections. By leveraging both gene expression and spatial information, CytoSPACE provides high-resolution insights into tissue organization and cellular distribution within spatially resolved datasets.
Tangram[ZL19] is a computational method that aligns single-cell RNA-seq data with spatial transcriptomics data, mapping cell types onto spatial locations across tissue sections. This approach enables high-resolution reconstruction of cellular organization, facilitating insights into spatial tissue architecture.
PASTA[L20] is a novel method for pathway-oriented spatial gene expression imputation that integrates pathway information, cell type, and spatial proximity to enhance prediction accuracy, robustness, and biological relevance, demonstrating superior performance on both simulated and real-world datasets.
CellTrek[L21] is a spatial mapping tool that uses gene expression and spatial information to map single-cell RNA-seq data onto spatial transcriptomics data. It accurately reconstructs the spatial positions of cells, providing a detailed view of cell-type distribution and tissue organization.

Visualization

SpatialView[L22] is a computational tool for spatial transcriptomics analysis that integrates gene expression with spatial context to identify spatial domains within tissue sections. By leveraging both spatial coordinates and gene expression profiles, SpatialView enables researchers to uncover spatially organized cell populations and analyze tissue architecture in greater detail.

Cell-cell interaction analysis (SpaCCI and CellChat v2 are recommended in experience-sharing by TBEL members, but these methods may perform differently in other problems)

SpaCCI[L23] is a method for spatially aware cell-cell interaction quantification that models each spot in spatial transcriptomics data (e.g., 10x Visium) as a mixture of cell types, estimating cell type-specific gene expression with non-negative least squares regression. By borrowing information from neighboring spots, SpaCCI captures both local and global cell-cell interaction patterns, providing insights into cellular interactions and underlying biological processes.
CellChat v2 [L24] is an updated tool for inferring cell-cell communication networks from scRNA-seq and spatial transcriptomics data by analyzing ligand-receptor interactions. It offers advanced visualization and analysis capabilities, helping researchers understand intercellular signaling in complex tissues.
MultiNicheNetR[L25] is a framework for analyzing cell-cell interactions in spatial transcriptomics data, focusing on identifying niche-specific signaling pathways. It uses multi-modal data to infer communication networks, revealing insights into cell interactions within different tissue regions.
COMMOT[L26] is a spatial analysis tool that models cell-cell interactions in spatial transcriptomics data by considering both spatial proximity and gene expression profiles. It identifies communication patterns and signaling networks, providing insights into spatially regulated cellular interactions.
CellPhoneDB[L27] is a database and computational tool for predicting cell-cell interactions by analyzing ligand-receptor pairs in single-cell and spatial transcriptomics data. It enables identification of key signaling pathways and interactions between cell types in tissue samples.
NicheDE/NicheNet [L28] is a method for exploring cell-cell interactions in spatial transcriptomics by linking ligand-receptor interactions to downstream gene expression changes in target cells. This approach allows researchers to identify niche-specific signaling and predict the impact of signals on recipient cells.

Power Calculation and Sample Size Justification

PoweREST[L29] (https://lanshui.shinyapps.io/PoweREST/) is a computational tool designed to perform power calculations for spatial transcriptomics experiments, helping researchers determine the optimal sample size and experimental design for detecting spatial gene expression patterns. By simulating spatial data and evaluating statistical power, PoweREST enables efficient planning of spatial transcriptomics studies, ensuring sufficient sensitivity for biological discoveries.

Super-resolution recovery

iStar[L30] is a tool for spatial transcriptomics that identifies spatially variable genes and integrates single-cell data to analyze cellular composition across tissue sections. It enhances spatial resolution and reveals patterns of gene expression that drive tissue organization.
BayesSpace[L31] , in addition to clustering, BayesSpace also provide functionality to perform super-resolution recovery.

Cell segmentation

Baysor[L32] is a segmentation method that optimizes 2D or 3D cell boundaries by considering the joint likelihood of transcriptional composition and cell morphology. It can perform segmentation based on detected transcripts alone or incorporate co-stains, making it versatile for various imaging-based spatial transcriptomics technologies.
SCS [L33] (Subcellular Spatial transcriptomics Cell Segmentation) combines imaging data with sequencing data to improve cell segmentation accuracy in high-resolution spatial transcriptomics. It uses a transformer neural network to adaptively learn the position of each spot relative to the cell center, outperforming traditional image-based segmentation methods in subcellular resolution datasets.
DeepCell[L34] is a deep learning library for cell segmentation. While not specifically designed for spatial transcriptomics, it can be applied to segment cells in staining images associated with spatial transcriptomics data.
Cellpose[L35] is a deep learning-based cell segmentation method. It can be used to segment cells in various types of microscopy images, including those from spatial transcriptomics experiments.
StarDist [L36] is a deep learning-based object detection method that can be used for cell nuclei segmentation. It's particularly useful for densely packed cells and can be applied to nuclear stains in spatial transcriptomics data.

Integrating histology and morphology in ST analysis

METI[L37] (Marker gene-based Expression deconvolution for Tissue-specific Inference) is a single-cell RNA-seq tool for identifying cell types based on marker gene expression. It estimates cell-type proportions within tissue samples, enabling deconvolution and cell-type analysis in single-cell and bulk RNA-seq datasets.

________________________________________________________________________________________________________

Schematic plot	Purpose	Methods
	Data preprocessing	Seurat
	Clustering	Seurat, SC3, etc.
	Annotation	SingleR, EasyCellType, NeuCA
	End-to-end data processing and analysis pipeline	SCRATCH
	Pseudotime analysis	Monocle
	Pan-cancer T cell analysis	TCellMap

Single cell RNA-seq data

Data pre-processing

Seurat [L38] provides a comprehensive pipeline for preprocessing single-cell RNA sequencing (scRNA-seq) data, including quality control, normalization, identification of highly variable genes, and dimensionality reduction. It then uses clustering and visualization techniques, such as PCA, t-SNE, and UMAP, to identify and interpret cell types and relationships within the data, supporting further analyses like differential expression.

Clustering

Seurat (Louvain and Leiden algorithms; Optimizated for large dataset clustering analysis)
SC3[L39] (Single-Cell Consensus Clustering) is an R/Bioconductor package for clustering single-cell RNA-seq data that combines multiple clustering algorithms to create a robust consensus-based clustering result. It filters low-quality cells and genes, selects the most variable genes, applies various clustering algorithms, and integrates the solutions into a consensus matrix, enabling stable clustering and insightful visualizations of cellular heterogeneity.
There are many clustering methods available for scRNA-seq. See review paper Duo et al. (2018) F1000 Research[L40]

Annotation

SingleR[L41] is an automated cell-type annotation tool for single-cell RNA-seq data that uses reference datasets of pure cell types to label individual cells. It compares each cell’s gene expression profile to reference profiles, enabling accurate and scalable cell-type identification.
EasyCellType[L42] is a user-friendly tool for automated cell-type annotation in single-cell RNA-seq data that leverages a combination of gene markers and pretrained classifiers. It allows researchers to quickly assign cell types, even with limited bioinformatics experience, enhancing accessibility to single-cell analysis.
NeuCA[L43] (Neural Cell-type Annotation) is a machine learning-based method for cell-type annotation in single-cell RNA-seq data, specifically designed for neural tissues. It uses deep learning models trained on neural cell types, providing precise and efficient cell-type identification in complex neural datasets.
There are many supervised and unsupervised cell type annotation methods. See our review paper (Sun et al., 2022, Briefings in Bioinformatics[L44] ) for more discussion.

End-to-end data processing and analysis pipeline

SCRATCH[L45] (Single-cell RNA-Seq Toolkit and Pipeline for Cancer Research) is a comprehensive computational toolkit designed to process, analyze, and visualize single-cell RNA-seq data specifically for cancer research applications. It includes modules for quality control, normalization, clustering, differential expression analysis, and pathway enrichment, providing an end-to-end pipeline to study cellular heterogeneity and tumor microenvironments.

Pseudo-time analysis

Monocle[L46] is a tool for single-cell RNA-seq (scRNA-seq) analysis that specializes in trajectory inference, enabling the study of dynamic cellular processes such as differentiation. By ordering cells along a pseudotime axis, Monocle helps researchers explore cell fate decisions, developmental pathways, and gene expression changes over time.

Pan-cancer T cell analysis

TCellMap[L47] is a computational method for analyzing single-cell RNA-seq (scRNA-seq) data that focuses on mapping and characterizing T cell populations in various biological contexts. It identifies distinct T cell subtypes, tracks their activation states, and enables exploration of functional diversity within T cell populations, facilitating insights into immune responses and disease mechanisms.

_______________________________________________________________________________________________________

Schematic plot	Purpose	Methods
	General analysis, preprocessing, diversity analysis, visualization	Immunarch
	Clustering and Antigen-specificity Prediction	DeepTCR, ClusTCR, TCRMatch, GIANA, GLIPH/GLIPH2

Bulk TCR data

General analysis, preprocessing, diversity analysis, visualization

Immunarch[L48] is an R package designed for the analysis of T-cell receptor (TCR) and B-cell receptor (BCR) sequencing data, providing tools for repertoire diversity, clonotype tracking, and visualization. It offers a comprehensive workflow for exploring immune repertoire dynamics, aiding in the understanding of immune responses.

Clustering and Antigen-specificity Prediction (DeepTCR and GLIPH2 are recommended in experience-sharing by TBEL members, but these methods may perform differently in other problems)

DeepTCR[L49] is a deep learning framework for analyzing TCR sequences, leveraging neural networks to identify patterns in immune receptor data associated with antigen specificity. It enables high-throughput analysis and predictive modeling of TCR-antigen interactions, facilitating insights into immune recognition.
ClusTCR[L50] is a clustering tool for TCR repertoire data that groups similar TCR sequences based on their CDR3 region, enabling the identification of clonally related T-cells. It provides insights into TCR diversity and helps uncover shared antigen specificity among T-cell clones.
TCRMatch[L51] is a computational tool that matches TCR sequences to known antigens by comparing TCR sequence motifs, helping to predict T-cell specificity. It facilitates the identification of TCR-antigen pairs, advancing research in immune response and immunotherapy.
GIANA[L52] (Grouped Inference for Antigen-specific TCR Analysis) is a machine learning tool for predicting TCR specificity based on CDR3 sequences, focusing on antigen-specific T-cell responses. It supports the study of immune specificity by identifying TCRs likely to respond to particular antigens.
GLIPH/GLIPH2 [L53] (Grouping of Lymphocyte Interactions by Paratope Hotspots) and its updated version GLIPH2 are algorithms that cluster TCR sequences based on shared antigen-specific motifs, identifying T-cells with similar antigen specificities. They are widely used for analyzing immune repertoires to understand T-cell targeting and antigen recognition patterns.

_____________________________________________________________________________________________________

Multiplex immunofluorescence (MxIF)

Cell segmentation

QuPath[L54] is an open-source software platform designed for the analysis and annotation of MxIF and other whole-slide imaging data. It provides tools for cell segmentation, phenotype classification, and spatial analysis, enabling researchers to analyze complex tissue structures and cellular interactions in high-resolution images.

_______________________________________________________________________________________________________

Bulk RNA-seq data

Pattern detection

CoGAPs[L55] (Coordinated Gene Activity in Pattern Sets) is a matrix factorization method that identifies gene expression patterns in single-cell and spatial transcriptomics data. By decomposing expression matrices into biologically relevant patterns, CoGAPS helps uncover underlying cellular processes and spatial heterogeneity in complex tissue samples.

[L1]https://www.nature.com/articles/s41467-022-30587-y

[L2]https://www.nature.com/articles/s41587-021-00935-2

[L3]https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02734-7

[L4]https://www.nature.com/articles/s41592-021-01255-8

[ZL5]https://academic.oup.com/bioinformatics/article/40/4/btae186/7641536

[Sh6]https://www.nature.com/articles/s41592-019-0701-7

[Sh7]https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02404-0

[Sh8]https://www.nature.com/articles/s41467-023-39748-z

[Sh9]https://github.com/pekjoonwu/CELINA

[Sh10]https://www.nature.com/articles/s41467-022-34879-1

[Sh11]https://www.nature.com/articles/s41467-023-35947-w

[Sh12]https://www.nature.com/articles/s41592-024-02316-4

[ZL13]https://www.nature.com/articles/s41587-022-01273-7

[ZL14]https://www.nature.com/articles/s41467-023-43600-9

[ZL15]https://academic.oup.com/nar/article/49/9/e50/6129341?login=false

[ZL16]https://www.nature.com/articles/s41587-021-00830-w

[ZL17]https://www.nature.com/articles/s41587-022-01517-6

[ZL18]https://www.nature.com/articles/s41587-023-01697-9

[ZL19]https://www.nature.com/articles/s41592-021-01264-7

[L20]https://github.com/rx-li/PASTA

[L21]https://www.nature.com/articles/s41587-022-01233-1

[L22]https://academic.oup.com/bioinformatics/article/40/3/btae117/7623009

[L23]https://litingku.github.io/SpaCCI/

[L24]https://www.nature.com/articles/s41596-024-01045-4

[L25]https://www.biorxiv.org/content/10.1101/2023.06.13.544751v1

[L26]https://www.nature.com/articles/s41592-022-01728-4

[L27]https://github.com/ventolab/CellphoneDB

[L28]https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03159-6

[L29]https://www.biorxiv.org/content/10.1101/2024.08.30.610564v1

[L30]https://www.nature.com/articles/s41587-023-02019-9

[L31]https://www.nature.com/articles/s41587-021-00935-2

[L32]https://www.nature.com/articles/s41587-021-01044-w

[L33]https://pmc.ncbi.nlm.nih.gov/articles/PMC10312435/

[L34]https://deepcell.org/about

[L35]https://www.nature.com/articles/s41592-022-01663-4

[L36]https://stardist.net/

[L37]https://www.nature.com/articles/s41467-024-51708-9

[L38]https://satijalab.org/seurat/

[L39]https://www.bioconductor.org/packages/release/bioc/html/SC3.html

[L40]Duò, Angelo, Mark D. Robinson, and Charlotte Soneson. "A systematic performance evaluation of clustering methods for single-cell RNA-seq data." F1000Research 7 (2018).

[L41]https://www.nature.com/articles/s41590-018-0276-y

[L42]https://www.bioconductor.org/packages/release/bioc/html/EasyCellType.html

[L43]https://www.nature.com/articles/s41598-021-04473-4

[L44]https://academic.oup.com/bib/article/23/2/bbab567/6502554

[L45]https://aacrjournals.org/cancerres/article/84/6_Supplement/863/741469/Abstract-863-Scratch-A-highly-modular-pipeline-for

[L46]https://www.nature.com/articles/s41586-019-0969-x

[L47]https://pubmed.ncbi.nlm.nih.gov/37248301/

[L48]https://immunarch.com/

[L49]https://www.nature.com/articles/s41467-021-21879-w

[L50]https://academic.oup.com/bioinformatics/article/37/24/4865/6300511

[L51]https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2021.640725/full

[L52]https://www.nature.com/articles/s41467-021-25006-7