*This page is under construction*

*Realigning multimodal measurements of the same cell*

Several recently described technologies allow for simultaneous measurement of different aspects of cell state. For example, sci-CAR jointly profiles RNA expression and chromatin accessibility on the same cell and CITE-seq measures surface protein abundance and RNA expression from each cell. However, these joint profiling methods have several tradeoffs compared to unimodal measurements.

Joint methods can be more expensive or lower throughput or more noisy than measuring a single modality at a time. Therefore it is useful to develop methods that are capable of integrating measurements of the same biological system but obtained using different technologies.

Here the goal is to learn a latent space where observations from the same cell acquired using different modalities. A perfect result has each of the paired observations sharing the same coordinates in the latent space.

Metrics for multimodal data integration aim to characterize how well the aligned datasets correspond to the ground truth.

**kNN AUC**: Let $f(i) ∈ F$ be the scRNA-seq measurement of cell $i$, and $g(i) ∈ G$ be the scATAC- seq measurement of cell $i$. kNN-AUC calculates the average percentage overlap of neighborhoods of $f(i)$ in $F$ with neighborhoods of $g(i)$ in $G$. Higher is better.**MSE**: Mean squared error (MSE) is the average distance between each pair of matched observations of the same cell in the learned latent space. Lower is better.

Rank | kNN Area Under the Curve | Mean squared error | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|

3 | 0.04 | 1.00 | 4.16 | 2.72 | Harmonic Alignment (log scran) | v0.0 | 2020 | |

2 | 0.06 | 1.00 | 0.51 | 1.00 | Harmonic Alignment (sqrt CPM) | v0.0 | 2020 | |

4 | 0.05 | 1.08 | 0.87 | 0.15 | Mutual Nearest Neighbors (log CPM) | v3.3.6 | 2018 | |

5 | 0.03 | 1.00 | 4.17 | 1.84 | Mutual Nearest Neighbors (log scran) | v3.3.6 | 2018 | |

1 | 0.20 | 0.58 | 0.33 | 0.04 | Procrustes | v1.5.3 | 1975 |

Rank | kNN Area Under the Curve | Mean squared error | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|

2 | 0.07 | 1.00 | 3.60 | 1.62 | Harmonic Alignment (log scran) | v0.0 | 2020 | |

5 | 0.04 | 1.00 | 0.62 | 0.66 | Harmonic Alignment (sqrt CPM) | v0.0 | 2020 | |

3 | 0.05 | 0.92 | 0.99 | 0.40 | Mutual Nearest Neighbors (log CPM) | v3.3.6 | 2018 | |

4 | 0.07 | 1.01 | 3.59 | 1.27 | Mutual Nearest Neighbors (log scran) | v3.3.6 | 2018 | |

1 | 0.08 | 0.89 | 0.50 | 0.22 | Procrustes | v1.5.3 | 1975 |

Rank | kNN Area Under the Curve | Mean squared error | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|

4 | 0.05 | 1.00 | 4.06 | 5.36 | Harmonic Alignment (log scran) | v0.0 | 2020 | |

2 | 0.05 | 1.00 | 1.47 | 3.43 | Harmonic Alignment (sqrt CPM) | v0.0 | 2020 | |

5 | 0.05 | 1.01 | 1.18 | 0.73 | Mutual Nearest Neighbors (log CPM) | v3.3.6 | 2018 | |

3 | 0.06 | 1.06 | 4.04 | 2.52 | Mutual Nearest Neighbors (log scran) | v3.3.6 | 2018 | |

1 | 0.07 | 0.94 | 0.76 | 0.38 | Procrustes | v1.5.3 | 1975 |

*Using cell labels from a reference dataset to annotate an unseen dataset*

A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes . However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.

To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated
reference dataset
and then *project* labels from the reference to the new dataset.

Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.

Metrics for label projection aim to characterize how well each classifer correctly assigns cell type labels to cells in the test set.

**Accuracy**: Average number of correctly applied labels.**F1 score**: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.

Rank | Accuracy | F1 score | Micro F1 score | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|---|

3 | 0.96 | 0.96 | 0.96 | 5.70 | 0.21 | Logistic regression (log CPM) | v0.23.2 | 2013 | |

4 | 0.96 | 0.96 | 0.96 | 7.52 | 4.67 | Logistic regression (log scran) | v0.23.2 | 2013 | |

2 | 0.96 | 0.96 | 0.96 | 5.01 | 0.23 | Multilayer perceptron (log CPM) | v0.23.2 | 1990 | |

1 | 0.96 | 0.96 | 0.96 | 7.76 | 5.00 | Multilayer perceptron (log scran) | v0.23.2 | 1990 |

Rank | Accuracy | F1 score | Micro F1 score | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|---|

4 | 0.99 | 0.99 | 0.99 | 4.98 | 0.22 | Logistic regression (log CPM) | v0.23.2 | 2013 | |

3 | 0.99 | 0.99 | 0.99 | 8.06 | 4.77 | Logistic regression (log scran) | v0.23.2 | 2013 | |

2 | 0.99 | 0.99 | 0.99 | 5.90 | 0.23 | Multilayer perceptron (log CPM) | v0.23.2 | 1990 | |

1 | 0.99 | 0.99 | 0.99 | 7.49 | 4.75 | Multilayer perceptron (log scran) | v0.23.2 | 1990 |

Rank | Accuracy | F1 score | Micro F1 score | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|---|

2 | 0.23 | 0.28 | 0.23 | 1.77 | 0.64 | Logistic regression (log CPM) | v0.23.2 | 2013 | |

1 | 0.24 | 0.28 | 0.24 | 11.94 | 7.76 | Logistic regression (log scran) | v0.23.2 | 2013 | |

4 | 0.20 | 0.22 | 0.20 | 1.76 | 0.45 | Multilayer perceptron (log CPM) | v0.23.2 | 1990 | |

3 | 0.22 | 0.26 | 0.22 | 10.50 | 7.65 | Multilayer perceptron (log scran) | v0.23.2 | 1990 |

Rank | Accuracy | F1 score | Micro F1 score | Memory (GB) | Runtime (min) | Name | Paper | Code | Year |
---|---|---|---|---|---|---|---|---|---|

3 | 0.83 | 0.82 | 0.83 | 1.68 | 0.77 | Logistic regression (log CPM) | v0.23.2 | 2013 | |

1 | 0.84 | 0.84 | 0.84 | 11.05 | 7.98 | Logistic regression (log scran) | v0.23.2 | 2013 | |

4 | 0.82 | 0.82 | 0.82 | 1.66 | 0.53 | Multilayer perceptron (log CPM) | v0.23.2 | 1990 | |

2 | 0.83 | 0.83 | 0.83 | 10.59 | 7.76 | Multilayer perceptron (log scran) | v0.23.2 | 1990 |