Denoising

Removing noise in sparse single-cell RNA-sequencing count data

3 datasets · 11 methods · 2 control methods · 2 metrics

Task info Method info Metric info Dataset info Results

Single-cell RNA-Seq protocols only detect a fraction of the mRNA molecules present in each cell. As a result, the measurements (UMI counts) observed for each gene and each cell are associated with generally high levels of technical noise (Grün et al., 2014). Denoising describes the task of estimating the true expression level of each gene in each cell. In the single-cell literature, this task is also referred to as imputation, a term which is typically used for missing data problems in statistics. Similar to the use of the terms “dropout”, “missing data”, and “technical zeros”, this terminology can create confusion about the underlying measurement process (Sarkar and Stephens, 2021).

A key challenge in evaluating denoising methods is the general lack of a ground truth. A recent benchmark study (Hou et al., 2020) relied on flow-sorted datasets, mixture control experiments (Tian et al., 2019), and comparisons with bulk RNA-Seq data. Since each of these approaches suffers from specific limitations, it is difficult to combine these different approaches into a single quantitative measure of denoising accuracy. Here, we instead rely on an approach termed molecular cross-validation (MCV), which was specifically developed to quantify denoising accuracy in the absence of a ground truth (Batson et al., 2019). In MCV, the observed molecules in a given scRNA-Seq dataset are first partitioned between a training and a test dataset. Next, a denoising method is applied to the training dataset. Finally, denoising accuracy is measured by comparing the result to the test dataset. The authors show that both in theory and in practice, the measured denoising accuracy is representative of the accuracy that would be obtained on a ground truth dataset.

Summary

poss_dataset_ids = dataset_info
  .map(d => d.dataset_id)
  .filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
  .map(d => d.method_id)
  .filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
  .map(d => d.metric_id)
  .filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))

results_long = results.flatMap(d => {
  return Object.entries(d.scaled_scores).map(([metric_id, value]) =>
    ({
      method_id: d.method_id,
      dataset_id: d.dataset_id,
      metric_id: metric_id,
      score: value
    })
  )
}).filter(d => method_ids.includes(d.method_id) && metric_ids.includes(d.metric_id) && dataset_ids.includes(d.dataset_id))

results_resources = results.flatMap(d => {
  return ({
    method_id: d.method_id,
    dataset_id: d.dataset_id,
    ...d.resources
  })
})

function label_time(time) {
  if (time < 1e-5) return "0s";
  if (time < 1) return "<1s";
  if (time < 60) return `${Math.floor(time)}s`;
  if (time < 3600) return `${Math.floor(time / 60)}m`;
  if (time < 3600 * 24) return `${Math.floor(time / 3600)}h`;
  if (time < 3600 * 24 * 7) return `${Math.floor(time / 3600 / 24)}d`;
  return ">7d"; // Assuming missing values are encoded as NaN
}

function label_memory(x_mb, include_mb = true) {
  if (!include_mb && x_mb < 1e3) return "<1G";
  if (x_mb < 1) return "<1M";
  if (x_mb < 1e3) return `${Math.round(x_mb)}M`;
  if (x_mb < 1e6) return `${Math.round(x_mb / 1e3)}G`;
  if (x_mb < 1e9) return `${Math.round(x_mb / 1e6)}T`;
  return ">1P";
}

function aggregate_scores(obj) {
  return d3.mean(obj.map(val => {
    if (val.score === undefined || isNaN(val.score)) return 0;
    return Math.min(1, Math.max(0, val.score))
  }));
}

function mean_na_rm(x) {
  return d3.mean(x.filter(d => !isNaN(d)));
}

function transpose_list_of_objects(list) {
  return Object.fromEntries(Object.keys(list[0]).map(key => [key, list.map(d => d[key])]))
}

overall = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => ({method_id, mean_score: aggregate_scores(values)}))

per_dataset = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const datasets = d3.groups(values, d => d.dataset_id)
      .map(([dataset_id, values]) => ({["dataset_" + dataset_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...datasets}
  })

per_metric = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const metrics = d3.groups(values, d => d.metric_id)
      .map(([metric_id, values]) => ({["metric_" + metric_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...metrics}
  })

resources = d3.groups(results_resources, d => d.method_id)
  .map(([method_id, values]) => {
    const error_pct_oom = d3.mean(values, d => d.exit_code === 137)
    const error_pct_timeout = d3.mean(values, d => d.exit_code === 143)
    const error_pct_error = d3.mean(values, d => d.exit_code > 0) - error_pct_oom - error_pct_timeout
    const error_pct_ok = 1 - error_pct_oom - error_pct_timeout - error_pct_error
    const mean_peak_memory_mb = mean_na_rm(values.map(d => d.peak_memory_mb))
    const mean_disk_read_mb = mean_na_rm(values.map(d => d.disk_read_mb))
    const mean_disk_write_mb = mean_na_rm(values.map(d => d.disk_write_mb))
    const mean_duration_sec = mean_na_rm(values.map(d => d.duration_sec))
    return ({
      method_id,
      error_pct_error,
      error_pct_oom,
      error_pct_timeout,
      error_pct_ok,
      // error_reason: {
      //   "Memory limit exceeded": error_pct_oom,
      //   "Time limit exceeded": error_pct_timeout,
      //   "Execution error": error_pct_error,
      //   "No error": error_pct_ok
      // },
      error_reason: [error_pct_oom, error_pct_timeout, error_pct_error, error_pct_ok],
      mean_cpu_pct: mean_na_rm(values.map(d => d.cpu_pct)),
      mean_peak_memory_mb,
      mean_peak_memory_log: -Math.log10(mean_peak_memory_mb),
      mean_peak_memory_str: " " + label_memory(mean_peak_memory_mb) + " ",
      mean_disk_read_mb: mean_na_rm(values.map(d => d.disk_read_mb)),
      mean_disk_read_log: -Math.log10(mean_disk_read_mb),
      mean_disk_read_str: " " + label_memory(mean_disk_read_mb) + " ",
      mean_disk_write_mb: mean_na_rm(values.map(d => d.disk_write_mb)),
      mean_disk_write_log: -Math.log10(mean_disk_write_mb),
      mean_disk_write_str: " " + label_memory(mean_disk_write_mb) + " ",
      mean_duration_sec,
      mean_duration_log: -Math.log10(mean_duration_sec),
      mean_duration_str: " " + label_time(mean_duration_sec) + " "
    })
  })

summary_all = method_info
  .filter(d => show_con || !d.is_baseline)
  .filter(d => method_ids.includes(d.method_id))
  .map(method => {
    const method_id = method.method_id
    const method_name = method.method_name
    const mean_score = overall.find(d => d.method_id === method_id).mean_score
    const datasets = per_dataset.find(d => d.method_id === method_id)
    const metrics = per_metric.find(d => d.method_id === method_id)
    const resources_ = resources.find(d => d.method_id === method_id)
    return {method_id, method_name, mean_score, ...datasets, ...metrics, ...resources_}
  })
  .sort((a, b) => b.mean_score - a.mean_score)

// make sure the first entry contains all columns
column_info = [
  {id: "method_name", name: "Name", label: null, group: "method", geom: "text", palette: null},
  {id: "mean_score", name: "Score", group: "overall", geom: "bar", palette: "overall"},
  {id: "error_reason", name: "Error reason", group: "overall", geom: "pie", palette: "error_reason"},
  ...dataset_info
    .filter(d => dataset_ids.includes(d.dataset_id)).map(d => ({id: "dataset_" + d.dataset_id, name: d.dataset_name, group: "dataset", geom: "funkyrect", palette: "dataset"}))
    .sort((a, b) => a.name.localeCompare(b.name)),
  ...metric_info
    .filter(d => metric_ids.includes(d.metric_id)).map(d => ({id: "metric_" + d.metric_id, name: d.metric_name, group: "metric", geom: "funkyrect", palette: "metric"}))
    .sort((a, b) => a.name.localeCompare(b.name)),
  {id: "mean_cpu_pct", name: "%CPU", group: "resources", geom: "funkyrect", palette: "resources"},
  {id: "mean_peak_memory_log", name: "Peak memory", label: "mean_peak_memory_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_disk_read_log", name: "Disk read", label: "mean_disk_read_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_disk_write_log", name: "Disk write", label: "mean_disk_write_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_duration_log", name: "Duration", label: "mean_duration_str", group: "resources", geom: "rect", palette: "resources"}
].map(d => {
  if (d.id === "method_name") {
    return {...d, options: {width: 15, hjust: 0}}
  } else if (d.id === "is_baseline") {
    return {...d, options: {width: 1}}
  } else if (d.geom === "bar") {
    return {...d, options: {width: 4}}
  } else {
    return d
  }
})

column_groups = [
  {group: "method", palette: null, level1: ""},
  {group: "overall", palette: "overall", level1: "Overall"},
  {group: "error_reason", palette: "error_reason", level1: "Error reason"},
  {group: "dataset", palette: "dataset", level1: dataset_info.length >= 3 ? "Datasets" : ""},
  {group: "metric", palette: "metric", level1: metric_info.length >= 3 ? "Metrics" : ""},
  {group: "resources", palette: "resources", level1: "Resources"}
]

palettes = [
  {
    overall: "Greys",
    dataset: "Blues",
    metric: "Reds",
    resources: "YlOrBr",
    error_reason: {
      colors: ["#8DD3C7", "#FFFFB3", "#BEBADA", "#FFFFFF"],
      names: ["Memory limit exceeded", "Time limit exceeded", "Execution error", "No error"]
    }
  }
][0]

funkyheatmap(
    transpose_list_of_objects(summary_all),
    transpose_list_of_objects(column_info),
    [],
    transpose_list_of_objects(column_groups),
    [],
    palettes,
    {
        fontSize: 14,
        rowHeight: 26,
        rootStyle: 'max-width: none',
        colorByRank: color_by_rank,
        theme: {
            oddRowBackground: 'var(--bs-body-bg)',
            evenRowBackground: 'var(--bs-button-hover)',
            textColor: 'var(--bs-body-color)',
            strokeColor: 'var(--bs-body-color)',
            headerColor: 'var(--bs-body-color)',
            hoverColor: 'var(--bs-body-color)'
        }
    },
    scale_column
);

Figure 1: Overview of the results per method. This figures shows the mean of the scaled scores (group Overall), the mean scores per dataset (group Dataset) and the mean scores per metric (group Metric).

Display settings

viewof color_by_rank = Inputs.toggle({label: "Color by rank:", value: true})
viewof scale_column = Inputs.toggle({label: "Minmax column:", value: false})
viewof show_con = Inputs.toggle({label: "Show control methods:", value: true})

Filter datasets

viewof dataset_ids = Inputs.checkbox(
  dataset_info.filter(d => poss_dataset_ids.includes(d.dataset_id)),
  {
    keyof: d => d.dataset_name,
    valueof: d => d.dataset_id,
    value: dataset_info.map(d => d.dataset_id),
    label: "Datasets:"
  }
)

Filter methods

viewof method_ids = Inputs.checkbox(
  method_info.filter(d => poss_method_ids.includes(d.method_id)),
  {
    keyof: d => d.method_name,
    valueof: d => d.method_id,
    value: method_info.map(d => d.method_id),
    label: "Methods:"
  }
)

Filter metrics

viewof metric_ids = Inputs.checkbox(
  metric_info.filter(d => poss_metric_ids.includes(d.metric_id)),
  {
    keyof: d => d.metric_name,
    valueof: d => d.metric_id,
    value: metric_info.map(d => d.metric_id),
    label: "Metrics:"
  }
)

funkyheatmap = (await require('d3@7').then(d3 => {
  window.d3 = d3;
  window._ = _;
  return import('https://unpkg.com/funkyheatmapjs@0.2.5');
})).default;

Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

Dataset info

Show

Pancreas (inDrop)

Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Here we just use the inDrop1 batch, which includes1937 cells × 15502 genes (Luecken et al. 2021).

1k Peripheral blood mononuclear cells

1k Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. Sequenced on 10X v3 chemistry in November 2018 by 10X Genomics (10x Genomics 2018).

Tabula Muris Senis Lung

All lung cells from Tabula Muris Senis, a 500k cell-atlas from 18 organs and tissues across the mouse lifespan. Here we use just 10x data from lung. 24540 cells × 16160 genes across 3 time points (Tabula Muris Consortium 2020).

Method info

Show

ALRA (log norm)

ALRA (Adaptively-thresholded Low Rank Approximation) is a method for imputation of missing values in single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first imputes values using rank-k approximation, using singular value decomposition. Next, a symmetric distribution is fitted to the near-zero imputed values for each gene (row) of the matrix. The right “tail” of this distribution is then used to threshold the accepted nonzero entries. This same threshold is then used to rescale the matrix, once the “biological zeros” have been removed (Linderman, Zhao, and Kluger 2018). Links: Docs.

ALRA (log norm, reversed normalization)

ALRA (sqrt norm)

ALRA (sqrt norm, reversed normalization)

DCA

DCA (Deep Count Autoencoder) is a method to remove the effect of dropout in scRNA-seq data. DCA takes into account the count structure, overdispersed nature and sparsity of scRNA-seq datatypes using a deep autoencoder with a zero-inflated negative binomial (ZINB) loss. The autoencoder is then applied to the dataset, where the mean of the fitted negative binomial distributions is used to fill each entry of the imputed matrix (Eraslan et al. 2019). Links: Docs.

KNN smoothing

KNN-smoothing is a method for denoising data based on the k-nearest neighbours. Given a normalised scRNA-seq matrix, KNN-smoothing calculates a k-nearest neighbour matrix using Euclidean distances between cell pairs. Each cell’s denoised expression is then defined as the average expression of each of its neighbours (Open Problems for Single Cell Analysis Consortium 2022). Links: Docs.

Iterative KNN smoothing

Iterative kNN-smoothing is a method to repair or denoise noisy scRNA-seq expression matrices. Given a scRNA-seq expression matrix, KNN-smoothing first applies initial normalisation and smoothing. Then, a chosen number of principal components is used to calculate Euclidean distances between cells. Minimally sized neighbourhoods are initially determined from these Euclidean distances, and expression profiles are shared between neighbouring cells. Then, the resultant smoothed matrix is used as input to the next step of smoothing, where the size (k) of the considered neighbourhoods is increased, leading to greater smoothing. This process continues until a chosen maximum k value has been reached, at which point the iteratively smoothed object is then optionally scaled to yield a final result (Wagner, Yan, and Yanai 2018). Links: Docs.

MAGIC

MAGIC (Markov Affinity-based Graph Imputation of Cells) is a method for imputation and denoising of noisy or dropout-prone single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first calculates Euclidean distances between each pair of cells in the dataset, which is then augmented using a Gaussian kernel (function) and row-normalised to give a normalised affinity matrix. A t-step markov process is then calculated, by powering this affinity matrix t times. Finally, the powered affinity matrix is right-multiplied by the normalised data, causing the final imputed values to take the value of a per-gene average weighted by the affinities of cells. The resultant imputed matrix is then rescaled, to more closely match the magnitude of measurements in the normalised (input) matrix (Dijk et al. 2018). Links: Docs.

MAGIC (approximate)

MAGIC (approximate, reversed normalization)

MAGIC (reversed normalization)

Control method info

Show

No denoising

Denoised outputs are defined from the unmodified input data

Perfect denoising

Denoised outputs are defined from the target data

Metric info

Show

Mean-squared error

The mean squared error between the denoised counts of the training dataset and the true counts of the test dataset after reweighting by the train/test ratio (Batson, Royer, and Webber 2019).

Poisson loss

The Poisson log likelihood of observing the true counts of the test dataset given the distribution given in the denoised dataset (Batson, Royer, and Webber 2019).

Quality control results

Show

Category	Name	Value	Condition	Severity
Scaling	Worst score knn_smoothing poisson	-10.298315	worst_score >= -1	✗✗✗
Scaling	Worst score alra_sqrt poisson	-2.301203	worst_score >= -1	✗✗

Normalisation visualisation

Show

References

10x Genomics. 2018. “1k PBMCs from a Healthy Donor (V3 Chemistry).” https://www.10xgenomics.com/resources/datasets/1-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0.

Batson, Joshua, Loı̈c Royer, and James Webber. 2019. “Molecular Cross-Validation for Single-Cell RNA-Seq.” bioRxiv. https://doi.org/10.1101/786269.

Dijk, David van, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J. Carr, Cassandra Burdziak, et al. 2018. “Recovering Gene Interactions from Single-Cell Data Using Data Diffusion.” Cell 174 (3): 716–729.e27. https://doi.org/10.1016/j.cell.2018.05.061.

Eraslan, Gökcen, Lukas M. Simon, Maria Mircea, Nikola S. Mueller, and Fabian J. Theis. 2019. “Single-Cell RNA-Seq Denoising Using a Deep Count Autoencoder.” Nature Communications 10 (1). https://doi.org/10.1038/s41467-018-07931-2.

Linderman, George C., Jun Zhao, and Yuval Kluger. 2018. “Zero-Preserving Imputation of scRNA-Seq Data Using Low-Rank Approximation.” bioRxiv. https://doi.org/10.1101/397588.

Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2021. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.

Open Problems for Single Cell Analysis Consortium. 2022. “Open Problems.” https://openproblems.bio.

Tabula Muris Consortium. 2020. “A Single-Cell Transcriptomic Atlas Characterizes Ageing Tissues in the Mouse.” Nature 583 (7817): 590–95. https://doi.org/10.1038/s41586-020-2496-1.

Wagner, Florian, Yun Yan, and Itai Yanai. 2018. “K-Nearest Neighbor Smoothing for High-Throughput Single-Cell RNA-Seq Data.” bioRxiv. https://doi.org/10.1101/217737.