# Multimodal Data Integration

Realigning multimodal measurements of the same cell

Several recently described technologies allow for simultaneous measurement of different aspects of cell state. For example, sci-CAR jointly profiles RNA expression and chromatin accessibility on the same cell and CITE-seq measures surface protein abundance and RNA expression from each cell. However, these joint profiling methods have several tradeoffs compared to unimodal measurements.

Joint methods can be more expensive or lower throughput or more noisy than measuring a single modality at a time. Therefore it is useful to develop methods that are capable of integrating measurements of the same biological system but obtained using different technologies.

Here the goal is to learn a latent space where observations from the same cell acquired using different modalities. A perfect result has each of the paired observations sharing the same coordinates in the latent space.

## The metrics

Metrics for multimodal data integration aim to characterize how well the aligned datasets correspond to the ground truth.

• kNN AUC: Let $f(i) ∈ F$ be the scRNA-seq measurement of cell $i$, and $g(i) ∈ G$ be the scATAC- seq measurement of cell $i$. kNN-AUC calculates the average percentage overlap of neighborhoods of $f(i)$ in $F$ with neighborhoods of $g(i)$ in $G$. Higher is better.
• MSE: Mean squared error (MSE) is the average distance between each pair of matched observations of the same cell in the learned latent space. Lower is better.

## The results

### CITE-seq Cord Blood Mononuclear Cells

RankkNN Area Under the CurveMean squared errorMemory (GB)Runtime (min)NamePaperCodeYear
30.041.004.162.72Harmonic Alignment (log scran)

v0.0

2020
20.061.000.511.00Harmonic Alignment (sqrt CPM)

v0.0

2020
40.051.080.870.15Mutual Nearest Neighbors (log CPM)

v3.3.6

2018
50.031.004.171.84Mutual Nearest Neighbors (log scran)

v3.3.6

2018
10.200.580.330.04Procrustes

v1.5.3

1975

### sciCAR Cell Lines

RankkNN Area Under the CurveMean squared errorMemory (GB)Runtime (min)NamePaperCodeYear
20.071.003.601.62Harmonic Alignment (log scran)

v0.0

2020
50.041.000.620.66Harmonic Alignment (sqrt CPM)

v0.0

2020
30.050.920.990.40Mutual Nearest Neighbors (log CPM)

v3.3.6

2018
40.071.013.591.27Mutual Nearest Neighbors (log scran)

v3.3.6

2018
10.080.890.500.22Procrustes

v1.5.3

1975

### sciCAR Mouse Kidney

RankkNN Area Under the CurveMean squared errorMemory (GB)Runtime (min)NamePaperCodeYear
40.051.004.065.36Harmonic Alignment (log scran)

v0.0

2020
20.051.001.473.43Harmonic Alignment (sqrt CPM)

v0.0

2020
50.051.011.180.73Mutual Nearest Neighbors (log CPM)

v3.3.6

2018
30.061.064.042.52Mutual Nearest Neighbors (log scran)

v3.3.6

2018
10.070.940.760.38Procrustes

v1.5.3

1975

# Label projection

Using cell labels from a reference dataset to annotate an unseen dataset

A major challenge for integrating single cell datasets is creating matching cell type annotations for each cell. One of the most common strategies for annotating cell types is referred to as “cluster-then-annotate” whereby cells are aggregated into clusters based on feature similarity and then manually characterized based on differential gene expression or previously identified marker genes. Recently, methods have emerged to build on this strategy and annotate cells using known marker genes . However, these strategies pose a difficulty for integrating atlas-scale datasets as the particular annotations may not match.

To ensure that the cell type labels in newly generated datasets match existing reference datasets, some methods align cells to a previously annotated reference dataset and then project labels from the reference to the new dataset.

Here, we compare methods for annotation based on a reference dataset. The datasets consist of two or more samples of single cell profiles that have been manually annotated with matching labels. These datasets are then split into training and test batches, and the task of each method is to train a cell type classifer on the training set and project those labels onto the test set.

## The metrics

Metrics for label projection aim to characterize how well each classifer correctly assigns cell type labels to cells in the test set.

• Accuracy: Average number of correctly applied labels.
• F1 score: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.

## The results

### Pancreas (by batch)

RankAccuracyF1 scoreMicro F1 scoreMemory (GB)Runtime (min)NamePaperCodeYear
30.960.960.965.700.21Logistic regression (log CPM)

v0.23.2

2013
40.960.960.967.524.67Logistic regression (log scran)

v0.23.2

2013
20.960.960.965.010.23Multilayer perceptron (log CPM)

v0.23.2

1990
10.960.960.967.765.00Multilayer perceptron (log scran)

v0.23.2

1990

### Pancreas (random split)

RankAccuracyF1 scoreMicro F1 scoreMemory (GB)Runtime (min)NamePaperCodeYear
40.990.990.994.980.22Logistic regression (log CPM)

v0.23.2

2013
30.990.990.998.064.77Logistic regression (log scran)

v0.23.2

2013
20.990.990.995.900.23Multilayer perceptron (log CPM)

v0.23.2

1990
10.990.990.997.494.75Multilayer perceptron (log scran)

v0.23.2

1990

### Zebrafish (by labels)

RankAccuracyF1 scoreMicro F1 scoreMemory (GB)Runtime (min)NamePaperCodeYear
20.230.280.231.770.64Logistic regression (log CPM)

v0.23.2

2013
10.240.280.2411.947.76Logistic regression (log scran)

v0.23.2

2013
40.200.220.201.760.45Multilayer perceptron (log CPM)

v0.23.2

1990
30.220.260.2210.507.65Multilayer perceptron (log scran)

v0.23.2

1990

### Zebrafish (random split)

RankAccuracyF1 scoreMicro F1 scoreMemory (GB)Runtime (min)NamePaperCodeYear
30.830.820.831.680.77Logistic regression (log CPM)

v0.23.2

2013
10.840.840.8411.057.98Logistic regression (log scran)

v0.23.2

2013
40.820.820.821.660.53Multilayer perceptron (log CPM)

v0.23.2

1990
20.830.830.8310.597.76Multilayer perceptron (log scran)

v0.23.2

1990