# Dimensionality reduction for visualisation

Reduction of high-dimensional datasets to 2D for visualization & interpretation

4 datasets · 23 methods · 3 control methods · 10 metrics

Dimensionality reduction is one of the key challenges in single-cell data representation. Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly 20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional elements encoded in mRNA such as lncRNAs). Since its inception, scRNA-seq experiments have been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq experiments would yield a few hundred cells, at best. Now, it is not uncommon to see experiments that yield over 100,000 cells or even > 1 million cells.

Each *feature* in a dataset functions as a single dimension. While each of the ~30,000 dimensions measured in each cell contribute to an underlying data structure, the overall structure of the data is challenging to display in few dimensions due to data sparsity and the *“curse of dimensionality”* (distances in high dimensional data don’t distinguish data points well). Thus, we need to find a way to dimensionally reduce the data for visualization and interpretation.

## Summary

## Display settings

## Filter datasets

## Filter methods

## Filter metrics

## Metrics

**continuity**(Zhang, Shang, and Zhang 2021): Continuity measures error of hard extrusions based on nearest neighbor coranking.**Density preservation**(Narayan, Berger, and Cho 2021): Similarity between local densities in the high-dimensional data and the reduced data.**Distance correlation**(Schober, Boer, and Schwarte 2018): Spearman correlation between all pairwise Euclidean distances in the original and dimension-reduced data.**Distance correlation (spectral)**(Coifman and Lafon 2006): Spearman correlation between all pairwise diffusion distances in the original and dimension-reduced data.**local continuity meta criterion**(Zhang, Shang, and Zhang 2021): The local continuity meta criterion is the co-KNN size with baseline removal which favors locality.**global property**(Zhang, Shang, and Zhang 2021): The global property metric is a summary of the global co-KNN.**local property**(Zhang, Shang, and Zhang 2021): The local property metric is a summary of the local co-KNN.**co-KNN size**(Zhang, Shang, and Zhang 2021): co-KNN size counts how many points are in both k-nearest neighbors before and after the dimensionality reduction.**co-KNN AUC**(Zhang, Shang, and Zhang 2021): co-KNN AUC is area under the co-KNN curve.**trustworthiness**(Venna and Kaski 2001): a measurement of similarity between the rank of each point’s nearest neighbors in the high-dimensional data and the reduced data.

## Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

## Details

## Methods

**densMAP (logCP10k)**(Narayan, Berger, and Cho 2021): densMAP is a modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data. It is performed on the same inputs as UMAP. Links: Docs.**densMAP (logCP10k, 1kHVG)**(Narayan, Berger, and Cho 2021): densMAP is a modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data. It is performed on the same inputs as UMAP. Links: Docs.**densMAP PCA (logCP10k)**(Narayan, Berger, and Cho 2021): densMAP is a modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data. It is performed on the same inputs as UMAP. Links: Docs.**densMAP PCA (logCP10k, 1kHVG)**(Narayan, Berger, and Cho 2021): densMAP is a modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data. It is performed on the same inputs as UMAP. Links: Docs.**Diffusion maps**(Coifman and Lafon 2006): Diffusion maps uses an affinity matrix to describe the similarity between data points, which is then transformed into a graph Laplacian. The eigenvalue-weighted eigenvectors of the graph Laplacian are then used to create the embedding. Diffusion maps is calculated on the logCPM expression matrix. Links: Docs.**NeuralEE (CPU) (Default)**(Xiong et al. 2020): NeuralEE is a neural network implementation of elastic embedding. It is a non-linear method that preserves pairwise distances between data points. NeuralEE uses a neural network to optimize an objective function that measures the difference between pairwise distances in the original high-dimensional space and the two-dimensional space. It is computed on both the recommended input from the package authors of 500 HVGs selected from a logged expression matrix (without sequencing depth scaling) and the default logCPM matrix with 1000 HVGs. Links: Docs.**NeuralEE (CPU) (logCP10k, 1kHVG)**(Xiong et al. 2020): NeuralEE is a neural network implementation of elastic embedding. It is a non-linear method that preserves pairwise distances between data points. NeuralEE uses a neural network to optimize an objective function that measures the difference between pairwise distances in the original high-dimensional space and the two-dimensional space. It is computed on both the recommended input from the package authors of 500 HVGs selected from a logged expression matrix (without sequencing depth scaling) and the default logCPM matrix with 1000 HVGs. Links: Docs.**PCA (logCP10k)**(Pearson 1901): PCA or “Principal Component Analysis” is a linear method that finds orthogonal directions in the data that capture the most variance. The first two principal components are chosen as the two-dimensional embedding. We select only the first two principal components as the two-dimensional embedding. PCA is calculated on the logCPM expression matrix with and without selecting 1000 HVGs. Links: Docs.**PCA (logCP10k, 1kHVG)**(Pearson 1901): PCA or “Principal Component Analysis” is a linear method that finds orthogonal directions in the data that capture the most variance. The first two principal components are chosen as the two-dimensional embedding. We select only the first two principal components as the two-dimensional embedding. PCA is calculated on the logCPM expression matrix with and without selecting 1000 HVGs. Links: Docs.**PHATE (default)**(Moon et al. 2019): PHATE or “Potential of Heat - diffusion for Affinity - based Transition Embedding” uses the potential of heat diffusion to preserve trajectories in a dataset via a diffusion process. It is an affinity - based method that creates an embedding by finding the dominant eigenvalues of a Markov transition matrix. We evaluate several variants including using the recommended square - root transformed CPM matrix as input, this input with the gamma parameter set to zero and the normal logCPM transformed matrix with and without HVG selection. Links: Docs.**PHATE (logCP10k, 1kHVG)**(Moon et al. 2019): PHATE or “Potential of Heat - diffusion for Affinity - based Transition Embedding” uses the potential of heat diffusion to preserve trajectories in a dataset via a diffusion process. It is an affinity - based method that creates an embedding by finding the dominant eigenvalues of a Markov transition matrix. We evaluate several variants including using the recommended square - root transformed CPM matrix as input, this input with the gamma parameter set to zero and the normal logCPM transformed matrix with and without HVG selection. Links: Docs.**PHATE (logCP10k)**(Moon et al. 2019): PHATE or “Potential of Heat - diffusion for Affinity - based Transition Embedding” uses the potential of heat diffusion to preserve trajectories in a dataset via a diffusion process. It is an affinity - based method that creates an embedding by finding the dominant eigenvalues of a Markov transition matrix. We evaluate several variants including using the recommended square - root transformed CPM matrix as input, this input with the gamma parameter set to zero and the normal logCPM transformed matrix with and without HVG selection. Links: Docs.**PHATE (gamma=0)**(Moon et al. 2019): PHATE or “Potential of Heat - diffusion for Affinity - based Transition Embedding” uses the potential of heat diffusion to preserve trajectories in a dataset via a diffusion process. It is an affinity - based method that creates an embedding by finding the dominant eigenvalues of a Markov transition matrix. We evaluate several variants including using the recommended square - root transformed CPM matrix as input, this input with the gamma parameter set to zero and the normal logCPM transformed matrix with and without HVG selection. Links: Docs.**PyMDE Preserve Distances (logCP10k)**(Agrawal, Ali, and Boyd 2021): PyMDE is a Python implementation of minimum-distortion embedding. It is a non-linear method that preserves distances between cells or neighborhoods in the high-dimensional space. It is computed with options to preserve distances between cells or neighbourhoods and with the logCPM matrix with and without HVG selection as input. Links: Docs.**PyMDE Preserve Distances (logCP10k, 1kHVG)**(Agrawal, Ali, and Boyd 2021): PyMDE is a Python implementation of minimum-distortion embedding. It is a non-linear method that preserves distances between cells or neighborhoods in the high-dimensional space. It is computed with options to preserve distances between cells or neighbourhoods and with the logCPM matrix with and without HVG selection as input. Links: Docs.**PyMDE Preserve Neighbors (logCP10k)**(Agrawal, Ali, and Boyd 2021): PyMDE is a Python implementation of minimum-distortion embedding. It is a non-linear method that preserves distances between cells or neighborhoods in the high-dimensional space. It is computed with options to preserve distances between cells or neighbourhoods and with the logCPM matrix with and without HVG selection as input. Links: Docs.**PyMDE Preserve Neighbors (logCP10k, 1kHVG)**(Agrawal, Ali, and Boyd 2021): PyMDE is a Python implementation of minimum-distortion embedding. It is a non-linear method that preserves distances between cells or neighborhoods in the high-dimensional space. It is computed with options to preserve distances between cells or neighbourhoods and with the logCPM matrix with and without HVG selection as input. Links: Docs.**Random Features**(Open Problems for Single Cell Analysis Consortium 2022): Randomly generated two-dimensional coordinates from a normal distribution. Links: Docs.**Spectral Features**(Open Problems for Single Cell Analysis Consortium 2022): Use 1000-dimensional diffusions maps as an embedding. Links: Docs.**True Features**(Open Problems for Single Cell Analysis Consortium 2022): Use of the original feature inputs as the ‘embedding’. Links: Docs.**t-SNE (logCP10k)**(van der Maaten and Hinton 2008): t-SNE or t-distributed Stochastic Neighbor Embedding converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. We use the implementation in the scanpy package with the result of PCA on the logCPM expression matrix (with and without HVG selection). Links: Docs.**t-SNE (logCP10k, 1kHVG)**(van der Maaten and Hinton 2008): t-SNE or t-distributed Stochastic Neighbor Embedding converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. We use the implementation in the scanpy package with the result of PCA on the logCPM expression matrix (with and without HVG selection). Links: Docs.**UMAP (logCP10k)**(McInnes, Healy, and Melville 2018): UMAP or Uniform Manifold Approximation and Projection is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. We perform UMAP on the logCPM expression matrix before and after HVG selection and with and without PCA as a pre-processing step. Links: Docs.**UMAP (logCP10k, 1kHVG)**(McInnes, Healy, and Melville 2018): UMAP or Uniform Manifold Approximation and Projection is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. We perform UMAP on the logCPM expression matrix before and after HVG selection and with and without PCA as a pre-processing step. Links: Docs.**UMAP PCA (logCP10k)**(McInnes, Healy, and Melville 2018): UMAP or Uniform Manifold Approximation and Projection is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. We perform UMAP on the logCPM expression matrix before and after HVG selection and with and without PCA as a pre-processing step. Links: Docs.**UMAP PCA (logCP10k, 1kHVG)**(McInnes, Healy, and Melville 2018): UMAP or Uniform Manifold Approximation and Projection is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. We perform UMAP on the logCPM expression matrix before and after HVG selection and with and without PCA as a pre-processing step. Links: Docs.

## Baseline methods

**Random Features**: Randomly generated two-dimensional coordinates from a normal distribution.**Spectral Features**: Use 1000-dimensional diffusions maps as an embedding.**True Features**: Use of the original feature inputs as the ‘embedding’.

## Datasets

**Mouse hematopoietic stem cell differentiation**(Nestorowa et al. 2016): 1.6k hematopoietic stem and progenitor cells from mouse bone marrow. Sequenced by Smart-seq2. 1920 cells x 43258 features with 3 cell type labels.**Mouse myeloid lineage differentiation**(Olsson et al. 2016): Myeloid lineage differentiation from mouse blood. Sequenced by SMARTseq in 2016 by Olsson et al. 660 cells x 112815 features with 4 cell type labels.**5k Peripheral blood mononuclear cells**(10x Genomics 2019): 5k Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. Sequenced on 10X v3 chemistry in July 2019 by 10X Genomics. 5247 cells x 20822 features with no cell type labels.**Zebrafish**(Wagner et al. 2018): 90k cells from zebrafish embryos throughout the first day of development, with and without a knockout of chordin, an important developmental gene. Dimensions: 26022 cells, 25258 genes. 24 cell types (avg. 1084±1156 cells per cell type).

## Quality control results

Category | Name | Value | Condition | Severity |
---|---|---|---|---|

Raw results | Dataset 'zebrafish_labs' %missing | 0.60 | pct_missing <= .1 | ✗✗✗ |

Raw results | Metric 'continuity' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Metric 'lcmc' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Metric 'qglobal' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Metric 'qlocal' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Metric 'qnn' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Metric 'qnn_auc' %missing | 0.25 | pct_missing <= .1 | ✗✗ |

Raw results | Method 'densmap_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'densmap_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'densmap_pca_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'densmap_pca_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'diffusion_map' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'neuralee_default' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'neuralee_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pca_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pca_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'phate_default' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'phate_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'phate_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'phate_sqrt' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pymde_distances_log_cp10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pymde_distances_log_cp10k_hvg' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pymde_neighbors_log_cp10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'pymde_neighbors_log_cp10k_hvg' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'random_features' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'spectral_features' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'true_features' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'tsne_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'tsne_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'umap_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'umap_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'umap_pca_logCP10k' %missing | 0.15 | pct_missing <= .1 | ✗ |

Raw results | Method 'umap_pca_logCP10k_1kHVG' %missing | 0.15 | pct_missing <= .1 | ✗ |

## Raw results

## Visualization of raw results

`Warning: Removed 1 row containing missing values (`geom_path()`).`

`Warning: Removed 156 rows containing missing values (`geom_point()`).`

## References

*Foundations and Trends in Machine Learning*14 (3): 211–378. https://doi.org/10.1561/2200000090.

*Applied and Computational Harmonic Analysis*21 (1): 5–30. https://doi.org/10.1016/j.acha.2006.04.006.

*arXiv*. https://doi.org/10.48550/arxiv.1802.03426.

*Nature Biotechnology*37 (12): 1482–92. https://doi.org/10.1038/s41587-019-0336-3.

*Nature Biotechnology*39 (6): 765–74. https://doi.org/10.1038/s41587-020-00801-7.

*Blood*128 (8): e20–31. https://doi.org/10.1182/blood-2016-05-716480.

*Nature*537 (7622): 698–702. https://doi.org/10.1038/nature19348.

*The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science*2 (11): 559–72. https://doi.org/10.1080/14786440109462720.

*Anesthesia & Analgesia*126 (5): 1763–68. https://doi.org/10.1213/ane.0000000000002864.

*Journal of Machine Learning Research*9 (86): 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.

*Artificial Neural Networks ICANN 2001*, 485–91. Springer Berlin Heidelberg. https://doi.org/{10.1007/3-540-44668-0\_68}.

*Science*360 (6392): 981–87. https://doi.org/10.1126/science.aar4362.

*Frontiers in Genetics*11. https://doi.org/10.3389/fgene.2020.00786.

*Heliyon*7 (2): e06199. https://doi.org/10.1016/j.heliyon.2021.e06199.