Cell type enrichment analysis and cellular deconvolution, are essential for understanding the heterogeneity of complex tissues in bulk transcriptomics data. xCell2
is an R package that developed upon the original xCell
methodology (Aran, et al 2017), offering improved algorithms and enhanced performance. The key advancement in xCell 2.0 is its genericity - users can now utilize any reference, including single-cell RNA-Seq data, to train an xCell2 reference object for analysis.
This package is particularly useful for researchers working with bulk transcriptomics data who want to infer the cellular composition of their samples. By leveraging reference data from various sources, xCell 2.0 offers a flexible and powerful tool for understanding the cellular heterogeneity in complex tissues.
This vignette provides an overview of the package’s features and step-by-step guidance on: - Preparing input data - Generating custom xCell2 reference objects - Performing cell type enrichment analysis - Interpreting results and best practices
Whether you are new to cell type deconvolution or an experienced bioinformatician, this guide will help you leverage xCell2
effectively in your research.
To install xCell2
from Bioconductor, use:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("xCell2")
To install the development version from GitHub, use:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("AlmogAngel/xCell2")
xCell2
relies on several Bioconductor packages. Most dependencies will be automatically installed. If you encounter any issues, you may need to manually install some packages, particularly ontoProc
(version 1.26.4 or higher).
After installation, load the package with:
library(xCell2)
xCell2Train
One of the key features of xCell2
is the ability to create custom reference objects tailored to your specific research needs. This section will guide you through the process of generating a custom xCell2
reference object using the xCell2Train
function.
Creating a custom reference allows you to: - Incorporate cell types specific to your research area - Use the latest single-cell RNA-seq data as a reference - Adapt the tool for non-standard organisms or tissues
Before using xCell2Train
, you need to prepare two key inputs:
1. Reference Gene Expression Matrix:
- Can be generated from various platforms: microarray, bulk RNA-Seq, or single-cell RNA-Seq
- Genes should be in rows, samples/cells in columns
- Should be normalized to both gene length and library size. Could be in either linear or logarithmic space.
"ont"
: Cell type ontology (e.g., “CL:0000545” or NA
if not applicable)"label"
: Cell type name (e.g., “T-helper 1 cell”)"sample"
: Sample/cell identifier matching column names in the reference matrix"dataset"
: Source dataset or subject identifierLet’s walk through an example using a subset of the Database of Immune Cell Expression (DICE):
# Load the demo data
data(dice_demo_ref, package = "xCell2")
# Extract reference matrix
dice_ref <- as.matrix(dice_demo_ref@assays@data$logcounts)
colnames(dice_ref) <- make.unique(colnames(dice_ref))
# Prepare labels data frame
dice_labels <- as.data.frame(dice_demo_ref@colData)
dice_labels$ont <- NA
dice_labels$sample <- colnames(dice_ref)
dice_labels$dataset <- "DICE"
You can skip the following step if: - You don’t want to use ontology to avoid cell type dependencies (not recommended) - You are sure that there are no cell type dependencies in your reference
To improve the quality of your custom reference, assign cell type ontologies using a controlled vocabulary:
dice_labels[dice_labels$label == "B cells", ]$ont <- "CL:0000236"
dice_labels[dice_labels$label == "Monocytes", ]$ont <- "CL:0000576"
dice_labels[dice_labels$label == "NK cells", ]$ont <- "CL:0000623"
dice_labels[dice_labels$label == "T cells, CD8+", ]$ont <- "CL:0000625"
dice_labels[dice_labels$label == "T cells, CD4+", ]$ont <- "CL:0000624"
dice_labels[dice_labels$label == "T cells, CD4+, memory", ]$ont <- "CL:0000897"
Use the xCell2GetLineage
function to check cell type dependencies:
xCell2::xCell2GetLineage(labels = dice_labels, outFile = "demo_dice_dep.tsv")
## loading from cache
## Warning in xCell2::xCell2GetLineage(labels = dice_labels, outFile =
## "demo_dice_dep.tsv"): It is recommended that you manually check the cell type
## lineage file: demo_dice_dep.tsv
Open demo_dice_dep.tsv
and verify the lineage assignments.
Note that “T cells, CD4+, memory” assigned as a descendant of “T cells, CD4+”
With our inputs prepared, we can now create the xCell2 reference object. Simply run the following command:
set.seed(123) # (optional) For reproducibility
DICE_demo.xCell2Ref <- xCell2::xCell2Train(
ref = dice_ref,
labels = dice_labels,
refType = "rnaseq"
)
## Finding dependencies using cell type ontology...
## loading from cache
## Generating signatures...
## Learning linear transformation and spillover parameters...
## Your custom xCell2 reference object is ready!
## > Please consider sharing with others here: https://dviraran.github.io/xCell2ref
Note that we set seed for reproducibility as generating pseudo-bulk samples from scRNA-Seq reference based on random sampling of cells.
Key Parameters of xCell2Train
:
- ref: Your prepared reference gene expression matrix
- labels: The labels data frame you created
- refType: Type of reference data (“rnaseq”, “array”, or “sc”)
- useOntology: Whether to use ontological integration (default: TRUE)
- numThreads: Number of threads for parallel processing
For a full list of parameters and their descriptions, refer to the xCell2Train
function documentation.
After creating your custom reference, you can use it for cell type enrichment analysis with xCell2Analysis
. We’ll cover this in the next section. Remember, creating a robust reference is crucial for accurate results. Take time to ensure your input data is high-quality and properly annotated.
xCell2
offers pre-trained reference objects that can be easily downloaded and used for your analysis. These references cover various tissue types and are based on well-curated datasets.
Dataset | Study | Species | Normalization | nSamples/Cells | nCellTypes | Platform | Tissues |
---|---|---|---|---|---|---|---|
BlueprintEncode | Martens JHA and Stunnenberg HG (2013), The ENCODE Project Consortium (2012), Aran D (2019) | Homo Sapiens | TPM | 259 | 43 | RNA-seq | Mixed |
ImmGenData | The Immunological Genome Project Consortium (2008), Aran D (2019) | Mus Musculus | RMA | 843 | 19 | Microarray | Immune/Blood |
Immune Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 3626 | 40 | RNA-seq | Immune/Blood |
LM22 | Chen B (2019) | Homo Sapiens | RMA | 113 | 22 | Microarray | Mixed |
MouseRNAseqData | Benayoun B (2019) | Mus Musculus | TPM | 358 | 18 | RNA-seq | Mixed |
Pan Cancer | Nofech-Mozes I (2023) | Homo Sapiens | Counts | 25084 | 29 | scRNA-seq | Tumor |
Tabula Muris Blood | The Tabula Muris Consortium (2018) | Mus Musculus | Counts | 11145 | 6 | scRNA-seq | Bone Marrow, Spleen, Thymus |
Tabula Sapiens Blood | The Tabula Sapiens Consortium (2022) | Homo Sapiens | Counts | 11921 | 18 | scRNA-seq | Blood, Lymph_Node, Spleen, Thymus, Bone Marrow |
TME Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 8146 | 25 | RNA-seq | Tumor |
You can also quick access popular pre-trained references that are available within the xCell2 package:
data(BlueprintEncode.xCell2Ref)
Or download a pre-trained reference directly within R using the download.file()
function:
# Set the URL of the pre-trained reference
ref_url <- "https://dviraran.github.io/xCell2refs/references/BlueprintEncode.xCell2Ref.rds"
# Set the local filename to save the reference
local_filename <- "BlueprintEncode.xCell2Ref.rds"
# Download the file
download.file(ref_url, local_filename, mode = "wb")
# Load the downloaded reference
BlueprintEncode.xCell2Ref <- readRDS(local_filename)
Remember to choose a reference that’s appropriate for your specific tissue type and experimental context. The choice of reference can impact your results, so it’s important to select one that closely matches your biological system.
xCell2Analysis
After creating or obtaining an xCell2
reference object, the next step is to use it for cell type enrichment analysis on your bulk RNA-seq data. This section will guide you through using the xCell2Analysis
function and interpreting its results.
Before running the analysis, ensure you have:
1. An xCell2
reference object
2. A bulk gene expression matrix to analyze
For this example, we’ll use a pre-loaded demo reference and a sample bulk expression dataset:
# Load the demo reference object
data(DICE_demo.xCell2Ref, package = "xCell2")
# Load a sample bulk expression dataset
data(mix_demo, package = "xCell2")
xCell2Analysis
Now, let’s perform the cell type enrichment analysis:
xcell2_results <- xCell2::xCell2Analysis(
mix = mix_demo,
xcell2object = DICE_demo.xCell2Ref
)
Key Parameters:
- mix
: Your bulk mixture gene expression data (genes in rows, samples in columns)
- xcell2object
: An S4 object of class xCell2Object
(your reference)
- minSharedGenes
: Minimum fraction of shared genes required (default: 0.9)
- spillover
: Whether to use spillover correction (default: TRUE)
- numThreads
: Number of threads for parallel processing (default: 1)
For a full list of parameters and their descriptions, refer to the xCell2Analysis
function documentation.
The xCell2Analysis
function returns a matrix of cell type enrichment scores:
- Rows represent cell types
- Columns represent samples from your input mixture
- Higher scores indicate a stronger presence of that cell type in the sample
Important considerations: - Scores are relative, not absolute proportions - Compare scores across samples to identify differences in cell type composition - Consider the biological context of your samples when interpreting results
Once you have your xCell2
results, you can:
- Correlate cell type enrichment scores with clinical or experimental variables
- Perform differential enrichment analysis between sample groups
- Use the scores as features for machine learning models
Remember, xCell2
provides estimates of relative cell type abundance. For absolute quantification, additional experimental validation may be necessary.
If you encounter issues:
- Ensure your input data is properly formatted
- Check that your mix
and reference
use the same gene annotation system
- Try adjusting the minSharedGenes
parameter if many genes are missing
For more detailed troubleshooting, refer to the package documentation or seek help on the xCell2
GitHub issues page.
If you use xCell2
in your research, please cite:
Angel A, Naom L, Nabel-Levy S, Aran D. xCell 2.0: Robust Algorithm for Cell Type Proportion Estimation Predicts Response to Immune Checkpoint Blockade. bioRxiv 2024.
Aran, D., Hu, Z., & Butte, A. J. (2017). xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome biology, 18, 1-14.
Aran, D. (2021). Extracting insights from heterogeneous tissues. Nature Computational Science, 1(4), 247-248.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Jerusalem
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xCell2_0.99.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 GSEABase_1.67.0
## [3] rlang_1.1.4 magrittr_2.0.3
## [5] matrixStats_1.4.0 compiler_4.4.1
## [7] RSQLite_2.3.7 reshape2_1.4.4
## [9] png_0.1-8 vctrs_0.6.5
## [11] RcppZiggurat_0.1.6 quadprog_1.5-8
## [13] stringr_1.5.1 pkgconfig_2.0.3
## [15] crayon_1.5.3 fastmap_1.2.0
## [17] dbplyr_2.5.0 XVector_0.45.0
## [19] utf8_1.2.4 promises_1.3.0
## [21] rmarkdown_2.28 tzdb_0.4.0
## [23] pracma_2.4.4 graph_1.83.0
## [25] UCSC.utils_1.1.0 purrr_1.0.2
## [27] bit_4.0.5 xfun_0.47
## [29] Rfast_2.1.0 zlibbioc_1.51.1
## [31] cachem_1.1.0 GenomeInfoDb_1.41.1
## [33] jsonlite_1.8.8 blob_1.2.4
## [35] later_1.3.2 DelayedArray_0.31.11
## [37] BiocParallel_1.39.0 parallel_4.4.1
## [39] singscore_1.25.0 R6_2.5.1
## [41] stringi_1.8.4 bslib_0.8.0
## [43] limma_3.61.9 reticulate_1.39.0
## [45] GenomicRanges_1.57.1 jquerylib_0.1.4
## [47] Rcpp_1.0.13 bookdown_0.40
## [49] SummarizedExperiment_1.35.1 knitr_1.48
## [51] readr_2.1.5 IRanges_2.39.2
## [53] httpuv_1.6.15 Matrix_1.7-0
## [55] igraph_2.0.3 tidyselect_1.2.1
## [57] rstudioapi_0.16.0 abind_1.4-5
## [59] yaml_2.3.10 codetools_0.2-20
## [61] minpack.lm_1.2-4 curl_5.2.2
## [63] plyr_1.8.9 lattice_0.22-6
## [65] tibble_3.2.1 withr_3.0.1
## [67] Biobase_2.65.1 shiny_1.9.1
## [69] KEGGREST_1.45.1 evaluate_0.24.0
## [71] ontologyIndex_2.12 RcppParallel_5.1.9
## [73] BiocFileCache_2.13.0 Biostrings_2.73.1
## [75] pillar_1.9.0 BiocManager_1.30.25
## [77] filelock_1.0.3 MatrixGenerics_1.17.0
## [79] DT_0.33 stats4_4.4.1
## [81] generics_0.1.3 vroom_1.6.5
## [83] ggplot2_3.5.1 BiocVersion_3.20.0
## [85] S4Vectors_0.43.2 hms_1.1.3
## [87] munsell_0.5.1 scales_1.3.0
## [89] xtable_1.8-4 glue_1.7.0
## [91] tools_4.4.1 ontologyPlot_1.7
## [93] AnnotationHub_3.13.3 ontoProc_1.27.4
## [95] locfit_1.5-9.10 annotate_1.83.0
## [97] XML_3.99-0.17 grid_4.4.1
## [99] tidyr_1.3.1 edgeR_4.3.14
## [101] colorspace_2.1-1 AnnotationDbi_1.67.0
## [103] GenomeInfoDbData_1.2.12 cli_3.6.3
## [105] rappdirs_0.3.3 fansi_1.0.6
## [107] S4Arrays_1.5.7 dplyr_1.1.4
## [109] Rgraphviz_2.49.0 gtable_0.3.5
## [111] sass_0.4.9 digest_0.6.37
## [113] BiocGenerics_0.51.1 SparseArray_1.5.31
## [115] paintmap_1.0 htmlwidgets_1.6.4
## [117] memoise_2.0.1 htmltools_0.5.8.1
## [119] lifecycle_1.0.4 httr_1.4.7
## [121] statmod_1.5.0 mime_0.12
## [123] bit64_4.0.5