Data integration with SCTransform-normalized single-cell data in SeuratV5

Although the official tutorial for the new version (v5) of Seurat has documented the new features in great detail, the standard workflow for working with the SCTransform normalization method¹ and multi-sample integration²_^,³ became scattered across multiple pages. This has made it slightly difficult for users to follow the procedures correctly and perform the analyses as desired.

Here in this tutorial, we will summarize the workflow for performing SCTransform and data integration using Seurat version 5.

We will utilize two publicly available datasets of zebrafish early embryos. Dataset 1 is from Wagner et al. Sciecne (2018)⁴, and dataset 2 is from Farrell et al. Science (2018)⁵. Both datasets include the developmental timepoint of 6 hpf (hours post fertilization). We will try to integrate the two datasets using Seurat version 5 after performing SCTransform normalization on each sample individualy.

As a demonstration, we will utilize the “merged” version of these two datasets made available by the TOME project⁶. The “merged” SeuratObject can be downloaded here. It already included all the cells from both projects into one SeuratObject.
Typically, merging datasets like these requires significant (and often arduous) preprocessing. You might need to remap all the raw data to a consistent version of the reference genome or, if working directly from mapped count matrices, reduce the datasets to genes common to both datasets. Fortunately, these steps have already been completed by the TOME project, simplifying our task considerably. TOME project, thank you so much!

Load data

The cells from the 6 hpf developmental timepoint can be found in the following file. Let’s load it into R.

seurat_object_hpf6.rds

library(Seurat)
seurat_object <- readRDS("/path/to/TOME_zebrafish/seurat_object_hpf6.rds")

# Check the Seurat version of the loaded SeuratObject
Version(seurat_object)
# [1] ‘3.2.3’

The loaded SeuratObject was created under Seurat version 3.2.3. Let’s update it to Seurat version 5.

# Update to SeuratV5
seurat_object_V5 <- UpdateSeuratObject(seurat_object)

# Check the Seurat version again
seurat_object_V5@version
# [1] ‘5.0.1’

Here is a preview of the loaded SeuratObject:

# Preview of the metadata data frame
seurat_object_V5@meta.data %>% str()

'data.frame':   6718 obs. of  8 variables:
 $ orig.ident  : Factor w/ 5 levels "DEW042","DEW043",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ nCount_RNA  : num  3657 4097 3646 2352 2009 ...
 $ nFeature_RNA: int  1548 1579 1492 1048 1007 1092 644 1624 686 815 ...
 $ sample      : Factor w/ 6718 levels "DEW042_AAAACCTCC_AACAATCC",..: 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 ...
 $ stage       : Factor w/ 1 level "hpf6": 1 1 1 1 1 1 1 1 1 1 ...
 $ group       : Factor w/ 2 levels "F","W": 1 1 1 1 1 1 1 1 1 1 ...
 $ cell_state  : Factor w/ 9 levels "hpf6:apoptotic like",..: 4 4 8 8 3 4 4 4 7 4 ...
 $ cell_type   : Factor w/ 9 levels "apoptotic like",..: 4 4 8 8 3 4 4 4 7 4 ...

The merged SeuratObject contains data from five samples. Next, we will integrate them

Steps before integration: Split data, normalization, and PCA

Before performing integration, the data first has to be split into individual samples (i.e. a separate count matrix for each sample). Next each sample is normalized using SCTransform, and then PCA is performed to reduce the dimensionality of the expression data.

First, to split the data into individual samples (sample info is stored in the orig.ident column in the @meta.data):

seurat_object_V5[["RNA"]] <- split(
	seurat_object_V5[["RNA"]], 
	f = seurat_object_V5$orig.ident\
)

Now the count matrix is split into multiple matrices, one for each sample.

An object of class Seurat 
20418 features across 6718 samples within 1 assay 
Active assay: RNA (20418 features, 0 variable features)
 10 layers present: counts.ZFS, counts.DEW042, counts.DEW043, counts.DEW044, counts.DEW045, data.ZFS, data.DEW042, data.DEW043, data.DEW044, data.DEW045

Next, perform SCTransform. By default, normalization will be performed on each sample separately. If you skip the split() step above, then SCTransform will be performed on all the samples together.

# SCTransform
seurat_object_V5 <- SCTransform(seurat_object_V5, vst.flavor = "v2")

The next step to perform is PCA. By default, the calculated SCTransform residual (stored as scale.data) will be served as the input to the PCA step.

# PCA
seurat_object_V5 <- RunPCA(seurat_object_V5)

Now the data is ready for performing integration.

Data without integration

Before performing integration, let’s look at what the data look like without integration first.

# Data without integration
obj <- seurat_object_V5

#### Identify cell clusters
obj <- FindNeighbors(obj, 
                     dims = 1:30, 
                     reduction = "pca")
obj <- FindClusters(obj, 
                    resolution = 0.8, 
                    algorithm = 1)
obj <- RunUMAP(obj, 
               dims = 1:30, 
               reduction = "pca", 
               reduction.name = "umap.unintegrated")

Visualize as UMAP plots (using scCustomize⁷ for generating prettier plots), showing (1) which sample each cell is from, (2) clustering, and (3) cell types annotated by the authors.

The plots show that even without integration, cells of the same type already clustered with each other.
However, some cells from the ZFS sample (which is from a different study) obviously did not merge with the other cells of the same type (e.g. cluster 4 colored in light green).

# Load scCustomize library
library(scCustomize)

# Set the color for each sample
colors_samples <- DiscretePalette_scCustomize(
    num_colors = 5,
    palette = "varibow"
)


# UMAP plot: show sample of origin
DimPlot_scCustom(
    obj,
    reduction = 'umap.unintegrated', 
    label = FALSE, 
    group.by = "orig.ident",
    colors_use = colors_samples,
    #pt.size = 0.8
    ) +
    xlab(glue("UMAP (unintegrated) 1")) +
    ylab(glue("UMAP (unintegrated) 2")) & 
    ggtitle(NULL) & coord_fixed()



# UMAP plot: show cell clusters
DimPlot_scCustom(
    obj,
    reduction = 'umap.unintegrated', 
    label = FALSE, 
    group.by = "seurat_clusters",
    #colors_use = colors,
    #pt.size = 0.8
    ) +
    xlab(glue("UMAP (unintegrated) 1")) +
    ylab(glue("UMAP (unintegrated) 2")) & 
    ggtitle(NULL) & coord_fixed()


# UMAP plot: color by annotated cell types
DimPlot_scCustom(
    obj,
    reduction = 'umap.unintegrated', 
    label = FALSE, 
    group.by = "cell_type",
    #colors_use = colors,
    #pt.size = 0.8
    ) +
    xlab(glue("UMAP (unintegrated) 1")) +
    ylab(glue("UMAP (unintegrated) 2")) & 
    ggtitle(NULL) & coord_fixed()

Data integration

Next, let’s perform a Seurat CCA integration. Remember to specify the normalization.method = "SCT" option inside the function.

# Data integration
obj <- seurat_object_V5

obj <- IntegrateLayers(
    object = obj, 
    method = CCAIntegration,
    orig.reduction = "pca", 
    new.reduction = "integrated.cca",
    normalization.method = "SCT",
    verbose = TRUE
)

Identify the cell clusters by the following code. Note that now the reduction to call became the integrated data.

obj <- FindNeighbors(obj, 
                     reduction = "integrated.cca", 
                     dims = 1:30)
obj <- FindClusters(obj, 
                    resolution = 0.8)
obj <- RunUMAP(obj, 
               reduction = "integrated.cca", 
               dims = 1:30, 
               reduction.name = "umap.integrated")

Now let’s visualize the clustering result again. The ZFS sample-specific clusters have disppeared and now merge quite well with the cells from the other samples.

# Plot: UMAP; colored by sample
DimPlot_scCustom(
    obj,
    reduction = 'umap.integrated', 
    label = FALSE, 
    group.by = "orig.ident",
    colors_use = colors_samples
    ) +
    xlab(glue("UMAP (integrated) 1")) +
    ylab(glue("UMAP (integrated) 2")) &
    ggtitle(NULL) & coord_fixed()


# Plot: UMAP; colored by cluster
DimPlot_scCustom(
    obj,
    reduction = 'umap.integrated', 
    label = FALSE, 
    group.by = "seurat_clusters",
    #colors_use = colors,
    #pt.size = 0.8
    ) +
    xlab(glue("UMAP (integrated) 1")) +
    ylab(glue("UMAP (integrated) 2")) &
    ggtitle(NULL) & coord_fixed()


# Plot: UMAP; colored by cell type
DimPlot_scCustom(
    obj,
    reduction = 'umap.integrated', 
    label = FALSE, 
    group.by = "cell_type",
    #colors_use = colors,
    #pt.size = 0.8
    ) +
    xlab(glue("UMAP (integrated) 1")) +
    ylab(glue("UMAP (integrated) 2")) &
    ggtitle(NULL) & coord_fixed()

Identification of differentially expressed genes

In order to identify the differnetially expressed genes for each cluster, we first have to merge multiple SCTransform-normalized models back into one model⁸. Then the remaining steps are just the same as the standard workflow.

There is an optional step to merge the individual count matrices back into one big matrix. This may be required for some of the subsequent analyses.
- For example, the FindConservedMarkers() function compares data in the "RNA" assay by default; this requires the raw count matrices to be joined in advance.

# Identify markers
#### First merging multiple SCT models
obj <- PrepSCTFindMarkers(obj, assay = "SCT", verbose = TRUE)

all_markers <- FindAllMarkers(
    obj, 
    only.pos = TRUE, 
    min.pct = 0.25,   # Set your own parameter
    logfc.threshold = 0.25   # Set your own parameter
)

#### Find conserved markers across samples
obj[['RNA']] <- JoinLayers(obj[['RNA']])

conserved_markers <- FindConservedMarkers(
    obj, 
    ident.1 = 8,
    grouping.var = "orig.ident",
    verbose = TRUE
)

Please don’t hesitate to contact me if you spot any typos or mistakes. My email is jason.ckleong AT gmail.com

Author: Jason Cheok Kuan Leong @RCIES, SOKENDAI, Japan

https://satijalab.org/seurat/articles/sctransform_vignette ↩︎
https://satijalab.org/seurat/articles/integration_introduction ↩︎
https://satijalab.org/seurat/articles/seurat5_integration ↩︎
Wagner, D. E. et al. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science 360, 981–987 (2018). ↩︎
Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, (2018). ↩︎
Qiu, C. et al. Systematic reconstruction of cellular trajectories across mouse embryogenesis. Nat Genet 54, 328-341 (2022). ↩︎
Marsh, S. E. scCustomize: Custom visualizations & functions for streamlined analyses of single cell sequencing. Zenodo (2021). https://doi.org/10.5281/zenodo.5706430. ↩︎
More details can be found here: https://satijalab.org/seurat/reference/prepsctfindmarkers ↩︎

BIOLOGIST J

Data integration with SCTransform-normalized single-cell data in SeuratV5

Leave a ReplyCancel reply

Data integration with SCTransform-normalized single-cell data in SeuratV5

Load data

Steps before integration: Split data, normalization, and PCA

Data without integration

Data integration

Identification of differentially expressed genes

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from BIOLOGIST J