Label Projection
Automated cell type annotation from rich, labeled reference data
8 datasets · 16 methods · 2 control methods · 3 metrics
A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.
To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.
Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.
Summary
Display settings
Filter datasets
Filter methods
Filter metrics
Metrics
- Accuracy (Grandini, Bagli, and Visani 2020): Average number of correctly applied labels.
- F1 score (Grandini, Bagli, and Visani 2020): The F1 score is a weighted average of the precision and recall over all class labels, where an F1 score reaches its best value at 1 and worst score at 0, where each class contributes to the score relative to its frequency in the dataset.
- Macro F1 score (Grandini, Bagli, and Visani 2020): The macro F1 score is an unweighted F1 score, where each class contributes equally, regardless of its frequency.
Results
Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.
Details
Methods
- K-neighbors classifier (log CP10k) (Cover and Hart 1967): K-neighbors classifier uses the “k-nearest neighbours” approach, which is a popular machine learning algorithm for classification and regression tasks. The assumption underlying KNN in this context is that cells with similar gene expression profiles tend to belong to the same cell type. For each unlabelled cell, this method computes the k labelled cells (in this case, 5) with the smallest distance in PCA space, and assigns that cell the most common cell type among its k nearest neighbors. Links: Docs.
- K-neighbors classifier (log scran) (Cover and Hart 1967): K-neighbors classifier uses the “k-nearest neighbours” approach, which is a popular machine learning algorithm for classification and regression tasks. The assumption underlying KNN in this context is that cells with similar gene expression profiles tend to belong to the same cell type. For each unlabelled cell, this method computes the k labelled cells (in this case, 5) with the smallest distance in PCA space, and assigns that cell the most common cell type among its k nearest neighbors. Links: Docs.
- Logistic regression (log CP10k) (Hosmer Jr, Lemeshow, and Sturdivant 2013): Logistic Regression estimates parameters of a logistic function for multivariate classification tasks. Here, we use 100-dimensional whitened PCA coordinates as independent variables, and the model minimises the cross entropy loss over all cell type classes. Links: Docs.
- Logistic regression (log scran) (Hosmer Jr, Lemeshow, and Sturdivant 2013): Logistic Regression estimates parameters of a logistic function for multivariate classification tasks. Here, we use 100-dimensional whitened PCA coordinates as independent variables, and the model minimises the cross entropy loss over all cell type classes. Links: Docs.
- Majority Vote (Open Problems for Single Cell Analysis Consortium 2022): Assignment of all predicted labels as the most common label in the training data. Links: Docs.
- Multilayer perceptron (log CP10k) (Hinton 1989): MLP or “Multi-Layer Perceptron” is a type of artificial neural network that consists of multiple layers of interconnected neurons. Each neuron computes a weighted sum of all neurons in the previous layer and transforms it with nonlinear activation function. The output layer provides the final prediction, and network weights are updated by gradient descent to minimize the cross entropy loss. Here, the input data is 100-dimensional whitened PCA coordinates for each cell, and we use two hidden layers of 100 neurons each. Links: Docs.
- Multilayer perceptron (log scran) (Hinton 1989): MLP or “Multi-Layer Perceptron” is a type of artificial neural network that consists of multiple layers of interconnected neurons. Each neuron computes a weighted sum of all neurons in the previous layer and transforms it with nonlinear activation function. The output layer provides the final prediction, and network weights are updated by gradient descent to minimize the cross entropy loss. Here, the input data is 100-dimensional whitened PCA coordinates for each cell, and we use two hidden layers of 100 neurons each. Links: Docs.
- Random Labels (Open Problems for Single Cell Analysis Consortium 2022): Random assignment of predicted labels proportionate to label abundance in training data. Links: Docs.
- scANVI (All genes) (Xu et al. 2021): scANVI or “single-cell ANnotation using Variational Inference” is a semi-supervised variant of the scVI(Lopez et al. 2018) algorithm. Like scVI, scANVI uses deep neural networks and stochastic optimization to model uncertainty caused by technical noise and bias in single - cell transcriptomics measurements. However, scANVI also leverages cell type labels in the generative modelling. In this approach, scANVI is used to predict the cell type labels of the unlabelled test data. Links: Docs.
- scANVI (Seurat v3 2000 HVG) (Xu et al. 2021): scANVI or “single-cell ANnotation using Variational Inference” is a semi-supervised variant of the scVI(Lopez et al. 2018) algorithm. Like scVI, scANVI uses deep neural networks and stochastic optimization to model uncertainty caused by technical noise and bias in single - cell transcriptomics measurements. However, scANVI also leverages cell type labels in the generative modelling. In this approach, scANVI is used to predict the cell type labels of the unlabelled test data. Links: Docs.
- scArches+scANVI (All genes) (Lotfollahi et al. 2020): scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.
- scArches+scANVI (Seurat v3 2000 HVG) (Lotfollahi et al. 2020): scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.
- scArches+scANVI+xgboost (All genes) (Lotfollahi et al. 2020): scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.
- scArches+scANVI+xgboost (Seurat v3 2000 HVG) (Lotfollahi et al. 2020): scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.
- Seurat reference mapping (SCTransform) (Hao et al. 2021): Seurat reference mapping is a cell type label transfer method provided by the Seurat package. Gene expression counts are first normalised by SCTransform before computing PCA. Then it finds mutual nearest neighbours, known as transfer anchors, between the labelled and unlabelled part of the data in PCA space, and computes each cell’s distance to each of the anchor pairs. Finally, it uses the labelled anchors to predict cell types for unlabelled cells based on these distances. Links: Docs.
- True Labels (Open Problems for Single Cell Analysis Consortium 2022): Perfect assignment of the predicted labels from the test labels. Links: Docs.
- XGBoost (log CP10k) (Chen and Guestrin 2016): XGBoost is a gradient boosting decision tree model that learns multiple tree structures in the form of a series of input features and their values, leading to a prediction decision, and averages predictions from all its trees. Here, input features are normalised gene expression values. Links: Docs.
- XGBoost (log scran) (Chen and Guestrin 2016): XGBoost is a gradient boosting decision tree model that learns multiple tree structures in the form of a series of input features and their values, leading to a prediction decision, and averages predictions from all its trees. Here, input features are normalised gene expression values. Links: Docs.
Baseline methods
- Random Labels: Random assignment of predicted labels proportionate to label abundance in training data.
- True Labels: Perfect assignment of the predicted labels from the test labels.
Datasets
- CeNGEN (split by batch) (Hammarlund et al. 2018): 100k FACS-isolated C. elegans neurons from 17 experiments sequenced on 10x Genomics. Split into train/test by experimental batch. Dimensions: 100955 cells, 22469 genes. 169 cell types (avg. 597±800 cells per cell type).
- CeNGEN (random split) (Hammarlund et al. 2018): 100k FACS-isolated C. elegans neurons from 17 experiments sequenced on 10x Genomics. Split into train/test randomly. Dimensions: 100955 cells, 22469 genes. 169 cell types avg. 597±800 cells per cell type).
- Pancreas (by batch) (Luecken et al. 2021): Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test by experimental batch. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).
- Pancreas (random split) (Luecken et al. 2021): Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test randomly. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).
- Pancreas (random split with label noise) (Luecken et al. 2021): Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test randomly with 20% label noise. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).
- Tabula Muris Senis Lung (random split) (Tabula Muris Consortium 2020): All lung cells from Tabula Muris Senis, a 500k cell-atlas from 18 organs and tissues across the mouse lifespan. Split into train/test randomly. Dimensions: 24540 cells, 17985 genes. 39 cell types (avg. 629±999 cells per cell type).
- Zebrafish (by laboratory) (Wagner et al. 2018): 90k cells from zebrafish embryos throughout the first day of development, with and without a knockout of chordin, an important developmental gene. Split into train/test by laboratory. Dimensions: 26022 cells, 25258 genes. 24 cell types (avg. 1084±1156 cells per cell type).
- Zebrafish (random split) (Wagner et al. 2018): 90k cells from zebrafish embryos throughout the first day of development, with and without a knockout of chordin, an important developmental gene. Split into train/test randomly. Dimensions: 26022 cells, 25258 genes. 24 cell types (avg. 1084±1156 cells per cell type).
Quality control results
✓ All checks succeeded!