# Label Projection

Automated cell type annotation from rich, labeled reference data

## Description

A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.

To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then *project* labels from the reference to the new dataset.

Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.

## Summary

## Metrics

**Accuracy**^{1}: Average number of correctly applied labels.

**F1 score**^{1}: The F1 score is a weighted average of the precision and recall over all class labels, where an F1 score reaches its best value at 1 and worst score at 0, where each class contributes to the score relative to its frequency in the dataset.

**Macro F1 score**^{1}: The macro F1 score is an unweighted F1 score, where each class contributes equally, regardless of its frequency.

## Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

## Details

## Methods

**K-neighbors classifier (log CP10k)**^{6}: K-neighbors classifier uses the “k-nearest neighbours” approach, which is a popular machine learning algorithm for classification and regression tasks. The assumption underlying KNN in this context is that cells with similar gene expression profiles tend to belong to the same cell type. For each unlabelled cell, this method computes the \(k\) labelled cells (in this case, 5) with the smallest distance in PCA space, and assigns that cell the most common cell type among its \(k\) nearest neighbors. Links: Docs.

**K-neighbors classifier (log scran)**^{6}: K-neighbors classifier uses the “k-nearest neighbours” approach, which is a popular machine learning algorithm for classification and regression tasks. The assumption underlying KNN in this context is that cells with similar gene expression profiles tend to belong to the same cell type. For each unlabelled cell, this method computes the \(k\) labelled cells (in this case, 5) with the smallest distance in PCA space, and assigns that cell the most common cell type among its \(k\) nearest neighbors. Links: Docs.

**Logistic regression (log CP10k)**^{2}: Logistic Regression estimates parameters of a logistic function for multivariate classification tasks. Here, we use 100-dimensional whitened PCA coordinates as independent variables, and the model minimises the cross entropy loss over all cell type classes. Links: Docs.

**Logistic regression (log scran)**^{2}: Logistic Regression estimates parameters of a logistic function for multivariate classification tasks. Here, we use 100-dimensional whitened PCA coordinates as independent variables, and the model minimises the cross entropy loss over all cell type classes. Links: Docs.

**Majority Vote**^{9}: Assignment of all predicted labels as the most common label in the training data. Links: Docs.

**Multilayer perceptron (log CP10k)**^{3}: MLP or “Multi-Layer Perceptron” is a type of artificial neural network that consists of multiple layers of interconnected neurons. Each neuron computes a weighted sum of all neurons in the previous layer and transforms it with nonlinear activation function. The output layer provides the final prediction, and network weights are updated by gradient descent to minimize the cross entropy loss. Here, the input data is 100-dimensional whitened PCA coordinates for each cell, and we use two hidden layers of 100 neurons each. Links: Docs.

**Multilayer perceptron (log scran)**^{3}: MLP or “Multi-Layer Perceptron” is a type of artificial neural network that consists of multiple layers of interconnected neurons. Each neuron computes a weighted sum of all neurons in the previous layer and transforms it with nonlinear activation function. The output layer provides the final prediction, and network weights are updated by gradient descent to minimize the cross entropy loss. Here, the input data is 100-dimensional whitened PCA coordinates for each cell, and we use two hidden layers of 100 neurons each. Links: Docs.

**Random Labels**^{9}: Random assignment of predicted labels proportionate to label abundance in training data. Links: Docs.

**scANVI (All genes)**^{7}: scANVI or “single-cell ANnotation using Variational Inference” is a semi-supervised variant of the scVI(Lopez et al. 2018) algorithm. Like scVI, scANVI uses deep neural networks and stochastic optimization to model uncertainty caused by technical noise and bias in single - cell transcriptomics measurements. However, scANVI also leverages cell type labels in the generative modelling. In this approach, scANVI is used to predict the cell type labels of the unlabelled test data. Links: Docs.

**scANVI (Seurat v3 2000 HVG)**^{7}: scANVI or “single-cell ANnotation using Variational Inference” is a semi-supervised variant of the scVI(Lopez et al. 2018) algorithm. Like scVI, scANVI uses deep neural networks and stochastic optimization to model uncertainty caused by technical noise and bias in single - cell transcriptomics measurements. However, scANVI also leverages cell type labels in the generative modelling. In this approach, scANVI is used to predict the cell type labels of the unlabelled test data. Links: Docs.

**scArches+scANVI (All genes)**^{8}: scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.

**scArches+scANVI (Seurat v3 2000 HVG)**^{8}: scArches+scANVI or “Single-cell architecture surgery” is a deep learning method for mapping new datasets onto a pre-existing reference model, using transfer learning and parameter optimization. It first uses scANVI to build a reference model from the training data, and then apply scArches to map the test data onto the reference model and make predictions. Links: Docs.

**Seurat reference mapping (SCTransform)**^{5}: Seurat reference mapping is a cell type label transfer method provided by the Seurat package. Gene expression counts are first normalised by SCTransform before computing PCA. Then it finds mutual nearest neighbours, known as transfer anchors, between the labelled and unlabelled part of the data in PCA space, and computes each cell’s distance to each of the anchor pairs. Finally, it uses the labelled anchors to predict cell types for unlabelled cells based on these distances. Links: Docs.

**True Labels**^{9}: Perfect assignment of the predicted labels from the test labels. Links: Docs.

**XGBoost (log CP10k)**^{4}: XGBoost is a gradient boosting decision tree model that learns multiple tree structures in the form of a series of input features and their values, leading to a prediction decision, and averages predictions from all its trees. Here, input features are normalised gene expression values. Links: Docs.

**XGBoost (log scran)**^{4}: XGBoost is a gradient boosting decision tree model that learns multiple tree structures in the form of a series of input features and their values, leading to a prediction decision, and averages predictions from all its trees. Here, input features are normalised gene expression values. Links: Docs.

## Baseline methods

**Random Labels**: Random assignment of predicted labels proportionate to label abundance in training data.

**True Labels**: Perfect assignment of the predicted labels from the test labels.

## Datasets

**CeNGEN (split by batch)**^{12}: 100k FACS-isolated C. elegans neurons from 17 experiments sequenced on 10x Genomics. Split into train/test by experimental batch. Dimensions: 100955 cells, 22469 genes. 169 cell types (avg. 597±800 cells per cell type).

**CeNGEN (random split)**^{12}: 100k FACS-isolated C. elegans neurons from 17 experiments sequenced on 10x Genomics. Split into train/test randomly. Dimensions: 100955 cells, 22469 genes. 169 cell types avg. 597±800 cells per cell type).

**Pancreas (by batch)**^{10}: Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test by experimental batch. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).

**Pancreas (random split)**^{10}: Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test randomly. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).

**Pancreas (random split with label noise)**^{10}: Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Split into train/test randomly with 20% label noise. Dimensions: 16382 cells, 18771 genes. 14 cell types (avg. 1170±1703 cells per cell type).

**Tabula Muris Senis Lung (random split)**^{11}: All lung cells from Tabula Muris Senis, a 500k cell-atlas from 18 organs and tissues across the mouse lifespan. Split into train/test randomly. Dimensions: 24540 cells, 17985 genes. 39 cell types (avg. 629±999 cells per cell type).

**Zebrafish (by laboratory)**^{13}: 90k cells from zebrafish embryos throughout the first day of development, with and without a knockout of chordin, an important developmental gene. Split into train/test by laboratory. Dimensions: 26022 cells, 25258 genes. 24 cell types (avg. 1084±1156 cells per cell type).

**Zebrafish (random split)**^{13}: 90k cells from zebrafish embryos throughout the first day of development, with and without a knockout of chordin, an important developmental gene. Split into train/test randomly. Dimensions: 26022 cells, 25258 genes. 24 cell types (avg. 1084±1156 cells per cell type).

## Download raw data

Task info Method info Metric info Dataset info Results Quality control

## Quality control results

✓ All checks succeeded!