Dimensionality reduction is one of the key challenges in single-cell data representation. Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly 20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional elements encoded in mRNA such as lncRNAs). Since its inception, scRNA-seq experiments have been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq experiments would yield a few hundred cells, at best. Now, it is not uncommon to see experiments that yield over 100,000 cells or even > 1 million cells.
Each feature in a dataset functions as a single dimension. While each of the ~30,000 dimensions measured in each cell (not accounting for roughly 75-90% data dropout per cell, another issue entirely), likely contribute to some sort of data structure, the overall structure of the data is diluted due to the “curse of dimensionality”. In short, it’s difficult to visualize the contribution of each individual gene in a way that makes sense to the human eye, i.e., two or three dimensions (at most). Thus, we need to find a way to dimensionally reduce the data for visualization and interpretation.
Adapted from the sklearn documentation.
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.
It uses the scipy.sparse.linalg
ARPACK implementation of the truncated SVD as provided by scanpy.
Adapted from the sklearn documentation.
t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
It is highly recommended to use another dimensionality reduction method to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples. The implemented version first applies PCA with 50 dimensions before calling the function provided by scanpy.
Adapted from the umap-learn documentation.
UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. The first phase consists of constructing a fuzzy topological representation, based on nearest neighbours. The second phase is simply optimizing the low dimensional representation to have as close a fuzzy topological representation as possible to the full-dimensional data as measured by cross entropy.
The implemented version first applies PCA with 50 dimensions and calculates a nearest-neighbour graph before calling the modified implementation in scanpy.
A modification of UMAP that adds an extra cost term in order to preserve information about the relative local density of the data.
The implemented version first applies PCA with 50 dimensions before calling the function from umap-learn.
Variants:
The five main steps of PHATE are:
This implementation is from the phate package
Variants:
ivis is a machine learning library for reducing dimensionality of very large datasets using Siamese Neural Networks.
A neural network implementation of elastic embedding implemented in the NeuralEE package.
Variants:
A neural network generative model that uses the t-SNE objective as a constraint implemented in the scvis package.
$$ RMSE = \sqrt{ \sum_{i=1}^{n} \frac{(\hat{y}_i - y_i)^2}{n} } $$
Where $y_i$ is the sum of pairwise euclidean distances between each value embedded in low-dimensional space and $\hat{y_i}$ is the sum of pairwise euclidean distances between each value in the original, high-dimensional space. The goal, in terms of preservation of this space is to minimize the difference between these terms. Finding the root-mean of the square of all differences (Root mean square error or $RMSE$ is a simple way to represent this as a scalar, which can then be used to compare to other methods.
Kruskel’s stress uses the RMSE, more or less in the now commonly-used MDS (multi-dimensional scaling). We can calculate and plot Kruskel’s stress to get an idea where the majority of distortion of the topography of the data in high-dimensional space.
Adapted from the sklearn documentation.
Trustworthiness expresses to what extent the local structure in an embedding is retained. The trustworthiness is within [0, 1]. It is defined as
$$ T(k) = 1 - \frac{2}{nk (2n - 3k - 1)} \sum^n_{i=1} \sum_{j \in \mathcal{N}_{i}^{k}} \max(0, (r(i, j) - k)) $$
where for each sample i, $\mathcal{N}_{i}^{k}$ are its k nearest neighbors in the output space, and every sample j is its $r(i, j)$-th nearest neighbor in the input space. In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space.
References: * “Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study” J. Venna, S. Kaski * “Learning a Parametric Embedding by Preserving Local Structure” L.J.P. van der Maaten
The local density preservation metric that is part of the cost function for densMAP. Some parts of this are re-implemented as the are not exposed my the umap-learn package.
A set of metrics from the pyDRMetrics package. The implementation uses a slightly modified version of the original source code rather than the PyPI package that is now available.
Above is a “complex heatmap”, which aims to show the regions that contribute the most stress. You can see that while a majority of the stress comes from the left side of the plot (as shown by the top of the complex heat map), the center of that left set of clusters does not contribute much to the stress, leading us to believe that by the measure of RMSE, the topology is relatively well-preserved. The stress mostly comes from the clusters at the top and bottom of that group of clusters spread across the second PC.
We performed principle component analysis, obtaining the first 50 components. We can then calculate the relative stress using the RMSE for each, in comparison to the original data, $y$. As one might suspect, the more components used, the lower the amount of distortion of the original data.
We can make this comparison across multiple dimensionality reduction methods. We can see that t-SNE seems to distort the data the least, in terms of pairwise euclidean distances. This does not necessarily mean the data is best represented by t-SNE, however. There are multiple means of measuring the “goodness” of a dimensionality reduction; RMSE is simply one of them.