# Multimodal Data Integration

Alignment of cellular profiles from two different modalities

3 datasets · 5 methods · 2 control methods · 2 metrics

Cellular function is regulated by the complex interplay of different types of biological molecules (DNA, RNA, proteins, etc.), which determine the state of a cell. Several recently described technologies allow for simultaneous measurement of different aspects of cellular state. For example, sci-CAR jointly profiles RNA expression and chromatin accessibility on the same cell and CITE-seq measures surface protein abundance and RNA expression from each cell. These technologies enable us to better understand cellular function, however datasets are still rare and there are tradeoffs that these measurements make for to profile multiple modalities.

Joint methods can be more expensive or lower throughput or more noisy than measuring a single modality at a time. Therefore it is useful to develop methods that are capable of integrating measurements of the same biological system but obtained using different technologies on different cells.

Here the goal is to learn a latent space where cells profiled by different technologies in different modalities are matched if they have the same state. We use jointly profiled data as ground truth so that we can evaluate when the observations from the same cell acquired using different modalities are similar. A perfect result has each of the paired observations sharing the same coordinates in the latent space.

## Summary

## Metrics

**kNN Area Under the Curve**(Stanley et al. 2020): Let \(f(i) ∈ F\) be the scRNA-seq measurement of cell \(i\), and \(g(i) ∈ G\) be the scATAC- seq measurement of cell \(i\). kNN-AUC calculates the average percentage overlap of neighborhoods of \(f(i)\) in \(F\) with neighborhoods of \(g(i)\) in \(G\). Higher is better.**Mean squared error**(Lance et al. 2022): Mean squared error (MSE) is the average distance between each pair of matched observations of the same cell in the learned latent space. Lower is better.

## Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

## Details

## Methods

**Harmonic Alignment (log scran)**(Stanley et al. 2020): Harmonic alignment embeds cellular data from each modality into a common space by computing a mapping between the 100-dimensional diffusion maps of each modality. This mapping is computed by computing an isometric transformation of the eigenmaps, and concatenating the resulting diffusion maps together into a joint 200-dimensional space. This joint diffusion map space is used as output for the task. Links: Docs.**Harmonic Alignment (sqrt CP10k)**(Stanley et al. 2020): Harmonic alignment embeds cellular data from each modality into a common space by computing a mapping between the 100-dimensional diffusion maps of each modality. This mapping is computed by computing an isometric transformation of the eigenmaps, and concatenating the resulting diffusion maps together into a joint 200-dimensional space. This joint diffusion map space is used as output for the task. Links: Docs.**Mutual Nearest Neighbors (log CP10k)**(Haghverdi et al. 2018): Mutual nearest neighbors (MNN) embeds cellular data from each modality into a common space by computing a mapping between modality-specific 100-dimensional SVD embeddings. The embeddings are integrated using the FastMNN version of the MNN algorithm, which generates an embedding of the second modality mapped to the SVD space of the first. This corrected joint SVD space is used as output for the task. Links: Docs.**Mutual Nearest Neighbors (log scran)**(Haghverdi et al. 2018): Mutual nearest neighbors (MNN) embeds cellular data from each modality into a common space by computing a mapping between modality-specific 100-dimensional SVD embeddings. The embeddings are integrated using the FastMNN version of the MNN algorithm, which generates an embedding of the second modality mapped to the SVD space of the first. This corrected joint SVD space is used as output for the task. Links: Docs.**Procrustes superimposition**(Gower 1975): Procrustes superimposition embeds cellular data from each modality into a common space by aligning the 100-dimensional SVD embeddings to one another by using an isomorphic transformation that minimizes the root mean squared distance between points. The unmodified SVD embedding and the transformed second modality are used as output for the task. Links: Docs.**Random Features**(Open Problems for Single Cell Analysis Consortium 2022): 20-dimensional SVD is computed on the first modality, and is then randomly permuted twice, once for use as the output for each modality, producing random features with no correlation between modalities. Links: Docs.**True Features**(Open Problems for Single Cell Analysis Consortium 2022): 20-dimensional SVD is computed on the first modality, and this same embedding is used as output for both modalities, producing perfectly aligned features from each modality. Links: Docs.

## Baseline methods

**Random Features**: 20-dimensional SVD is computed on the first modality, and is then randomly permuted twice, once for use as the output for each modality, producing random features with no correlation between modalities.**True Features**: 20-dimensional SVD is computed on the first modality, and this same embedding is used as output for both modalities, producing perfectly aligned features from each modality.

## Datasets

**CITE-seq Cord Blood Mononuclear Cells**(Stoeckius et al. 2017): 8k cord blood mononuclear cells sequenced by CITEseq, a multimodal addition to the 10x scRNA-seq platform that allows simultaneous measurement of RNA and protein.**sciCAR Cell Lines**(Cao et al. 2018): 5k cells from a time-series of dexamethasone treatment sequenced by sci-CAR, a combinatorial indexing-based co-assay that jointly profiles chromatin accessibility and mRNA.**sciCAR Mouse Kidney**(Cao et al. 2018): 11k cells from adult mouse kidney sequenced by sci-CAR, a combinatorial indexing-based co-assay that jointly profiles chromatin accessibility and mRNA.

## Quality control results

✓ All checks succeeded!

## Raw results

## Visualization of raw results

## References

*Science*361 (6409): 1380–85. https://doi.org/10.1126/science.aau0730.

*Psychometrika*40 (1): 33–51. https://doi.org/10.1007/bf02291478.

*Nature Biotechnology*36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

*bioRxiv*. https://doi.org/10.1101/2022.04.11.487796.

*Proceedings of the 2020 SIAM International Conference on Data Mining*, 316–24. Society for Industrial; Applied Mathematics. https://doi.org/10.1137/1.9781611976236.36.

*Nature Methods*14 (9): 865–68. https://doi.org/10.1038/nmeth.4380.