## The task

Dimensionality reduction is one of the key challenges in single-cell data representation. Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly 20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional elements encoded in mRNA such as lncRNAs). Since its inception, scRNA-seq experiments have been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq experiments would yield a few hundred cells, at best. Now, it is not uncommon to see experiments that yield over [100,000 cells] (https://www.nature.com/articles/s41586-018-0590-4) or even > 1 million cells.

Each *feature* in a dataset functions as a single dimension. While each of the ~30,000
dimensions measured in each cell (not accounting for roughly 75-90% data dropout per
cell, another issue entirely), likely contribute to some sort of data structure, the
overall structure of the data is diluted due to the *“curse of
dimensionality”*. In short, it’s
difficult to visualize the contribution of each individual gene in a way that makes
sense to the human eye, i.e., two or three dimensions (at most). Thus, we need to find a
way to dimensionally reduce
the data for visualization and interpretation.

## The methods

### Principal component analysis (PCA)

*Adapted from the sklearn
documentation*.

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

It uses the `scipy.sparse.linalg`

ARPACK implementation of the truncated SVD as provided
by scanpy.

### t-distributed stochastic neighbor embedding (t-SNE)

*Adapted from the sklearn
documentation*.

t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

It is highly recommended to use another dimensionality reduction method to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples. The implemented version first applies PCA with 50 dimensions before calling the function provided by scanpy.

### Uniform manifold approximation and projection (UMAP)

*Adapted from the umap-learn
documentation*.

UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. The first phase consists of constructing a fuzzy topological representation, based on nearest neighbours. The second phase is simply optimizing the low dimensional representation to have as close a fuzzy topological representation as possible to the full-dimensional data as measured by cross entropy.

The implemented version first applies PCA with 50 dimensions and calculates a nearest-neighbour graph before calling the modified implementation in scanpy.

### densMAP

A modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data.

The implemented version first applies PCA with 50 dimensions before calling the function from umap-learn.

**Variants:**

- The (logCPM-normalized, 1000 HVG) expression matrix
- 50 principal components

### Potential of heat-diffusion for affinity-based transition embedding (PHATE)

The five main steps of PHATE are:

- Compute the pairwise distances from the data matrix.
- Transform the distances to affinities to encode local information.
- Learn global relationships via the diffusion process.
- Encode the learned relationships using the potential distance.
- Embed the potential distance information into low dimensions for visualization.

This implementation is from the phate package

**Variants:**

- The square-root CPM transformed expression matrix
- 50 principal components of the logCPM-normalised, 1000 HVG expression matrix

### ivis

ivis is a machine learning library for reducing dimensionality of very large datasets using Siamese Neural Networks.

### NeuralEE

A neural network implementation of elastic embedding implemented in the NeuralEE package.

**Variants:**

- Scaled 500 HVGs from a logged expression matrix (no library size normalization)
- LogCPM-normalised, 1000 HVG expression matrix

### scvis

A neural network generative model that uses the t-SNE objective as a constraint implemented in the scvis package.

## The metrics

### Root mean square error

$$ RMSE = \sqrt{ \sum_{i=1}^{n} \frac{(\hat{y}_i - y_i)^2}{n} } $$

Where $y_i$ is the sum of pairwise euclidean distances between each value embedded in low-dimensional space and $\hat{y_i}$ is the sum of pairwise euclidean distances between each value in the original, high-dimensional space. The goal, in terms of preservation of this space is to minimize the difference between these terms. Finding the root-mean of the square of all differences (Root mean square error or $RMSE$ is a simple way to represent this as a scalar, which can then be used to compare to other methods.

Kruskel’s
stress
uses the RMSE, more or less in the now commonly-used MDS (multi-dimensional scaling). We
can calculate and plot Kruskel’s stress to get an idea where the majority of distortion
of the * topography* of the data in high-dimensional space.

### Trustworthiness

*Adapted from the sklearn
documentation.*

Trustworthiness expresses to what extent the local structure in an embedding is retained. The trustworthiness is within [0, 1]. It is defined as

$$ T(k) = 1 - \frac{2}{nk (2n - 3k - 1)} \sum^n_{i=1} \sum_{j \in \mathcal{N}_{i}^{k}} \max(0, (r(i, j) - k)) $$

where for each sample i, $\mathcal{N}_{i}^{k}$ are its k nearest neighbors in the output space, and every sample j is its $r(i, j)$-th nearest neighbor in the input space. In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space.

References:

- “Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study” J. Venna, S. Kaski
- “Learning a Parametric Embedding by Preserving Local Structure” L.J.P. van der Maaten

### Density preservation

The local density preservation metric that is part of the cost function for densMAP. Some parts of this are re-implemented as the are not exposed my the umap-learn package.

### NN Ranking

A set of metrics from the pyDRMetrics package. The implementation uses a slightly modified version of the original source code rather than the PyPI package that is now available.

- Continuity - Measures error of hard extrusions
- co-KNN size - Counts how many points are in both k-nearest neighbors before and after the DR
- co-KNN AUC - The area under the co-KNN curve
- Local continuity meta criterion - co-KNN size with baseline removal which favors locality more
- Local property metric - Summary of the local co-KNN
- Global property metric - Summary of the global co-KNN

## Example results

Above is a “complex heatmap”, which aims to show the regions that contribute the most stress. You can see that while a majority of the stress comes from the left side of the plot (as shown by the top of the complex heat map), the center of that left set of clusters does not contribute much to the stress, leading us to believe that by the measure of RMSE, the topology is relatively well-preserved. The stress mostly comes from the clusters at the top and bottom of that group of clusters spread across the second PC.

We performed principle component analysis, obtaining the first 50 components. We can then calculate the relative stress using the RMSE for each, in comparison to the original data, $y$. As one might suspect, the more components used, the lower the amount of distortion of the original data.

We can make this comparison across multiple dimensionality reduction methods. We can see
that *t*-SNE seems to distort the data the least, in terms of pairwise euclidean
distances. This does not necessarily mean the data is best represented by *t*-SNE,
however. There are multiple means of measuring the “goodness” of a dimensionality
reduction; RMSE is simply one of them.

Dataset | Best Method | Paper | Code |
---|---|---|---|

5k Peripheral blood mononuclear cells | densMAP PCA (logCPM, 1kHVG) | v0.5.3 | |

Mouse hematopoietic stem cell differentiation | densMAP (logCPM, 1kHVG) | v0.5.3 | |

Mouse myeloid lineage differentiation | NeuralEE (CPU) (logCPM, 1kHVG) | v0.1.6 |