Calculate enrichment scores from scRNA-seq data

Input a Seurat object or scRNA-seq matrix, calculate the enrichment scores of AUCell, UCell, singscore, ssgsea, JASMINE and viper. Then, return a Seurat object including these score matrix.

irGSEA.score(
  object = NULL,
  assay = NULL,
  slot = "data",
  min.cells = 3,
  min.feature = 0,
  seeds = 123,
  ncores = 4,
  custom = F,
  geneset = NULL,
  geneset.weight = NULL,
  msigdb = T,
  species = "Homo sapiens",
  category = "H",
  subcategory = NULL,
  geneid = "symbol",
  minGSSize = 1,
  maxGSSize = 500,
  chunk = F,
  chunk.size = 5000,
  add = F,
  overwrite = T,
  method = c("AUCell", "UCell", "singscore", "ssgsea"),
  aucell.MaxRank = NULL,
  ucell.MaxRank = NULL,
  kcdf = "Gaussian",
  JASMINE.method = "oddsratio",
  VISION.latentSpace = NULL,
  VISION.dimRed = NULL,
  VISION.dimRedComponents = NULL
)

Arguments

object

Seurat object (V4 or V5), Assay object (V4 or V5), scRNA-seq sparse matrix/matrix/data.frame (Genes X Cells).

assay

Name of assay to use, defaults to the active assay. The parameter works when a seurat object is input. If input assay object, the assay object will regarded as RNA assay and included in output Seurat object.

slot

Default data. The parameter works if a seurat object is input. AddModuleScore, VAM should be data, and VISION, gficf, pagoda2 should be counts.

min.cells

The minimum detected cells per gene, default 3. The parameter worsk if a scRNA-seq matrix is input.

min.feature

The minimum genes per cell, default 0. The parameter works if a scRNA-seq matrix is input.

seeds

Default 123

ncores

Default 4

custom

Default False. Set it to true when input own genesets.

geneset

Default NULL. Input own genesets as a list. Each element in the list is a gene set. You can also specify positive and negative genes by adding a + or - sign in special gene set.

geneset.weight

Default NULL. Input is a list and the length of list should be equal to geneset. Each element in the list represent a gene set. The element is a vector. And the name of vector are gene names. And you can also specify the weight of each gene by a specific number. The parameter works if parameter "geneset" is not null and "wsum", "wmean", "mdt", "viper", "Sargent" are selected in "method".

msigdb

Default True. You can acquire the collection gene sets from msigdb database. It will be ignored when custom is set to true.

species

Default Homo sapiens. Use msigdbr::msigdbr_show_species() to view all available species. The parameter works if msigdb is True.

category

Default H. You can acquire the Hallmarker gene sets. Use msigdbr::msigdbr_collections to view all available collections gene sets. The parameter works if msigdb is True.

subcategory

Default NULL. The parameter works if msigdb is True.

geneid

Default symbol. Other options are "entrez" and "ensembl". The parameter works if msigdb is True.

minGSSize

Minimum number of genes in one gene set.

maxGSSize

Maximal number of genes in one gene set.

chunk

Divide the matrix according to chunks before scoring. The parameter works for AUCell, UCell, singscore, ssgsea, JASMINEh and viper. Default False. If the number of cells is greater than 50000, the parameter works.

chunk.size

Number of cells to be processed simultaneously (lower size requires slightly more computation but reduces memory demands). Default 5000.

add

Whether to add the new gene set scoring matrix based on the original gene set scoring matrix. Default False.

overwrite

Default True. The parameter works while the add is true. The same geneset name exists in two gene scoring matrices, the newly added geneset will overwrite the previous geneset if the overwrite is true. The newly added geneset will be forcibly renamed as "geneset's name + serial number" if the overwrite is false.

method

A vector. Default c("AUCell", "UCell", "singscore", "ssgsea", "JASMINE", "viper"). `AUCell (https://doi.org/10.1038/nmeth.4463)`: AUCell uses the area-under-the-curve (AUC) to calculate whether a gene set is enriched within the molecular readouts of each cell. AUC can be calculated using by default thetop 5 in the ranking.

`UCell (https://doi.org/10.1016/j.csbj.2021.06.043)`: UCell calculates gene signature scores based on the Mann-Whitney U statistic. And the U statistic is closely related to the area-under-the-curve (AUC) statistic for ROC curves

`singscore (https://doi.org/10.1093/nar/gkaa802)`: Singscore uses rank-based statistics to analyze each sample’s gene expression profile and scores the expression activities of gene sets at a single-sample level.

`ssgsea (https://doi.org/10.1038/nature08460)`: ssGSEA ranks gene expression within each cell separately, then the enrichment score of the gene set of each cell is calculated by K-S like random walk statistic.

`JASMINE (https://doi.org/10.7554/eLife.71994)`: JASMINE calculates the approximate mean using gene ranks among expressed genes and the enrichment of the signature in expressed genes. The two are then scaled to 0–1 and averaged to result in the final JASMINE score. JASMINE considers the enrichment of signature genes in expressed genes to counter dropout effects, and meanwhile, evaluates the average expression level of the expressed signature genes.

`VAM (https://doi.org/10.1093/nar/gkaa582)`: VAM generates scores from scRNA-seq data using a variation of the classic Mahalanobis multivariate distance measure.

`scSE (https://doi.org/10.1093/nar/gkz601)`: scSE, single cell signature explorer, measures a signature using normalized total expression of the signature genes.

`VISION (https://doi.org/10.1038/s41467-019-12235-0)`: Scores in Vision are calculated by averaging expressed genes for each gene set. To account for the influence of sample-level metrics (the number of UMIs/reads per cells), scores are then corrected by their means and standard deviations.

`wsum (https://doi.org/10.1093/bioadv/vbac016)`: First, multiply each gene by its associated weight which then are summed to an enrichment score wsum. Furthermore, permutations of random gene can be performed to obtain a null distribution that can be used to compute a z-score norm_wsum, or a corrected estimate corr_wsum by multiplying wsum by the minus log10 of the obtained empirical p-value. We use corr_wsum as the enrichment score for the gene set.

`wmean (https://doi.org/10.1093/bioadv/vbac016)`: Weighted Mean (WMEAN) is similar to WSUM but it divides the obtained enrichment score by the sum of the absolute value of weights

`mdt (https://doi.org/10.1093/bioadv/vbac016)`: MDT fits a multivariate regression random forest for each sample.

`viper (https://doi.org/10.1038/ng.3593)`: VIPER estimates biological activities by performing a three-tailed enrichment score calculation. First by a one-tail approach, based on the absolute value of the gene expression signature (i.e., genes are rank-sorted from the less invariant between groups to the most differentially expressed, regardless of the direction of change); and then by a two-tail approach, where the positions of the genes whose expression is repressed by the regulator are inverted in the gene expression signature before computing the enrichment score. The one-tail and two-tail enrichment score estimates are integrated while weighting their contribution based on the estimated mode of regulation through three-tail approach. The contribution of each target gene from a given regulon to the enrichment score is also weighted based on the regulator-target gene interaction confidence.

`GSVApy (https://doi.org/10.1038/ng.3593)`: The python version of GSVA is wrapped by the decoupler-py package.

`gficf (https://doi.org/10.1093/nargab/lqad024)`: gficf takes advantage of the informative biological signals spreading across the latent factors of gene expression values obtained from non-negative matrix factorization. It uses NMF and FGSEA to estimate enrichment scores.

`GSVA (https://doi.org/10.1186/1471-2105-14-7)`: GSVA, Gene Set Variation Analysis, starts by transforming the input molecular readouts matrix to a readout-level statistic using Gaussian kernel estimation of the cumulative density function. Then, readout-level statistics are ranked per sample and normalized to up-weight the two tails of the rank distribution. Afterwards, an enrichment score (ES) is calculated as in GSEA, using the running sum statistic. Finally, the ES can be normalized by subtracting the largest negative ES from the largest positive ES.

`zscore (https://doi.org/10.1371/journal.pcbi.1000217)`: zscore aggregates the expression of several genes of the gene set. The gene expression is scaled by mean and standard deviation over cells. Then, the enrichment scores for each cell are calculated by averaging scaled gene expression of all genes within each gene set.

`plage (https://doi.org/10.1186/1471-2105-6-225)`: plag captures enrichment scores from singular value decompositions (SVD) strategy. PLAGE first standardizes gene expression matrix across cells. For a submatrix which genes in a particular gene set, the first coefficient of right-singular vector in SVD of this matrix is extracted as enrichment scores.

`ssGSEApy (https://doi.org/10.1093/bioinformatics/btac757)`: The python version of ssGSEA is wrapped by the python package GSEApy.

`AddModuleScore (https://doi.org/10.1126/science.aad0501)`: Calculate the average expression levels of each program (cluster) on single cell level, subtracted by the aggregated expression of control feature sets. All analyzed features are binned based on averaged expression, and the control features are randomly selected from each bin.

`pagoda2 (https://doi.org/10.1038/nbt.4038)`: pagoda2 fits an error model for each cell to depict its properties, and residual variance of each gene in the cell is re-normalized subsequently. Then, the enrichment scores of each gene set is quantified by its first weighted principal component.

`Sargent (https://doi.org/10.1016/j.mex.2023.102196)`: Sargent sorts non-zero expressed genes from high to low expression for a given cell, and transforms an input gene-by-cell expression matrix into a corresponding gene-set-by-cell assignment score matrix. Then, Sargent calculates a gini-index among assignment scores per cell, transforming the gene-set-by-cell assignment score matrix to a distribution of indexes.

aucell.MaxRank

The threshold to calculate the AUC. Default only the top 5 of the expressed genes are used to checks whether the gene set are within the top 5 to change the threshold. The parameter works if "AUCell" is selected in "method".

ucell.MaxRank

Maximum number of genes to rank per cell. Above this rank, a given gene is considered as not expressed. Default 1500 when it set to NULL. You can input special number, such as 2000, to increase the rank range. The parameter works if "UCell" is selected in "method".

kcdf

Default Gaussian if input expression values are continuous in logarithmic scale. Other option is "Poisson" if input expression values are integer counts. The parameter works if "ssgsea" is selected in "method".

JASMINE.method

Default oddsratio. You can choose oddsratio or likelihood.

VISION.latentSpace

latent space for expression data. Numeric matrix or dataframe with dimensions CELLS x COMPONENTS. Latent space will be used to create the neighbors graph. if a latent space has been computed elsewhere (either via PCA, or other factor analysis methods such as Harmony, ZIFA, ZINB-WaVE or scVI) it can be provided via the latentSpace argument as a data frame. When the input to VISION.latentSpace is present, the inputs to VISION.dimRed and VISION.dimRedComponents are ignored.

VISION.dimRed

Dimensionality reduction of Seurat object to use for the latentSpace. Default is to look for "pca" and use that if it exists. Of course, you can also enter "harmony", if you want to execute VISION in the harmony dimension.

VISION.dimRedComponents

number of components to use for the selected dimensionality reduction. Default is to use all components.

Value

Seurat object including score matrix. If input Seurat object (V4) or Assay object (V4), return Seurat object (V4) including score matrix. If input Seurat object (V5) or Assay object (V5), return Seurat object (V5) including score matrix.

Examples

if (FALSE) {
# load PBMC dataset by R package SeuratData
library(Seurat)
library(SeuratData)
# download 3k PBMCs from 10X Genomics
InstallData("pbmc3k")
library(Seurat)

# Seurat object (V4)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
msigdb = T, species = "Homo sapiens", category = "H", geneid = "symbol",
method = c("AUCell", "UCell", "singscore", "ssgsea"), kcdf = 'Gaussian')


# Seurat object (V5)
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final2 <- CreateSeuratObject(counts = CreateAssay5Object(
GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts")),
meta.data = pbmc3k.final[[]])
pbmc3k.final2 <- NormalizeData(pbmc3k.final2)
pbmc3k.final2 <- irGSEA.score(object = pbmc3k.final2, assay = "RNA", slot = "data",
msigdb = T, species = "Homo sapiens", category = "H", geneid = "symbol",
method = c("AUCell", "UCell", "singscore", "ssgsea"), kcdf = 'Gaussian')

# Assay object (V5)
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final3 <- CreateAssay5Object(counts = GetAssayData(pbmc3k.final,
assay = "RNA", slot = "counts"))
pbmc3k.final3 <- NormalizeData(pbmc3k.final3)
pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final3, assay = "RNA", slot = "data",
msigdb = T, species = "Homo sapiens", category = "H", geneid = "symbol",
method = c("AUCell", "UCell", "singscore", "ssgsea"), kcdf = 'Gaussian')

# Data.fram, Matrix, or dgmatrix
pbmc3k.final2 <- irGSEA.score(object = GetAssayData(pbmc3k.final,
assay = "RNA", slot = "counts"),
assay = "RNA", slot = "data", min.cells = 3, min.feature = 0,
method = c("AUCell", "UCell", "singscore", "ssgsea"), kcdf = 'Poisson')

# custom geneset
markers <- list()
markers$Tcell_gd <- c("TRDC+", "TRGC1+", "TRGC2+", "TRDV1+","TRAC-","TRBC1-","TRBC2-")
markers$Tcell_NK <- c("FGFBP2+", "SPON2+", "KLRF1+", "FCGR3A+", "CD3E-","CD3G-")
markers$Tcell_CD4 <- c("CD4","CD40LG")
markers$Tcell_CD8 <- c("CD8A","CD8B")
markers$Tcell_Treg <- c("FOXP3","IL2RA")
pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = markers, method = c("AUCell", "UCell", "singscore", "ssgsea"),
kcdf = 'Gaussian')
}