Dimensionality reduction for visualisation

The task

Dimensionality reduction is one of the key challenges in single-cell data representation. Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly 20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional elements encoded in mRNA such as lncRNAs). Since its inception, scRNA-seq experiments have been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq experiments would yield a few hundred cells, at best. Now, it is not uncommon to see experiments that yield over 100,000 cells or even > 1 million cells.

Each feature in a dataset functions as a single dimension. While each of the ~30,000 dimensions measured in each cell contribute to an underlying data structure, the overall structure of the data is challenging to display in few dimensions due to data sparsity and the “curse of dimensionality” (distances in high dimensional data don’t distinguish data points well). Thus, we need to find a way to dimensionally reduce the data for visualization and interpretation.

The metrics

  • Distance correlation: the Spearman correlation between ground truth distances in the high-dimensional data and Euclidean distances in the dimension-reduced data, invariant to scalar multiplication. Distance correlation computes high-dimensional distances in Euclidean space, while Distance correlation (spectral) computes diffusion distances (i.e. Euclidean distances on the Laplacian Eigenmap).
  • Trustworthiness: a measurement of similarity between the rank of each point’s nearest neighbors in the high-dimensional data and the reduced data (Venna & Kaski, 2001).
  • Density preservation: similarity between local densities in the high-dimensional data and the reduced data (Narayan, Berger & Cho, 2020)
  • NN Ranking: a set of metrics from pyDRMetrics relating to the preservation of nearest neighbors in the high-dimensional data and the reduced data.