A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes. However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.
To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.
Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.
Metrics for label projection aim to characterize how well each classifer correctly assigns cell type labels to cells in the test set.
- Accuracy: Average number of correctly applied labels.
- F1 score: The F1 score is a weighted average of the precision and recall over all class labels, where an F1 score reaches its best value at 1 and worst score at 0, where each class contributes to the score relative to its frequency in the dataset.
- Macro F1 score: The macro F1 score is an unweighted F1 score, where each class contributes equally, regardless of its frequency.
|CeNGEN (random split)||Logistic regression (log CP10k)|
|CeNGEN (split by batch)||Logistic regression (log CP10k)|
|Pancreas (by batch)||Multilayer perceptron (log CP10k)|
|Pancreas (random split with label noise)||Logistic regression (log CP10k)|
|Pancreas (random split)||Multilayer perceptron (log CP10k)|
|Tabula Muris Senis Lung (random split)||Multilayer perceptron (log scran)|
|Zebrafish (by laboratory)||Seurat reference mapping (SCTransform)|
|Zebrafish (random split)||Seurat reference mapping (SCTransform)|