# Batch integration feature

Removing batch effects while preserving biological variation (feature output)

3 datasets · 18 methods · 7 control methods · 11 metrics

This is a sub-task of the overall batch integration task. Batch (or data) integration integrates datasets across batches that arise from various biological and technical sources. Methods that integrate batches typically have three different types of output: a corrected feature matrix, a joint embedding across batches, and/or an integrated cell-cell similarity graph (e.g., a kNN graph). This sub-task focuses on all methods that can output feature matrices. Other sub-tasks for batch integration can be found for:

- graphs, and
- embeddings

This sub-task was taken from a benchmarking study of data integration methods.

## Summary

## Display settings

## Filter datasets

## Filter methods

## Filter metrics

## Metrics

**ARI**(Luecken et al. 2021): ARI (Adjusted Rand Index) compares the overlap of two clusterings. It considers both correct clustering overlaps while also counting correct disagreements between two clustering.**Cell Cycle Score**(Luecken et al. 2021): The cell-cycle conservation score evaluates how well the cell-cycle effect can be captured before and after integration.**Graph connectivity**(Luecken et al. 2021): The graph connectivity metric assesses whether the kNN graph representation, G, of the integrated data connects all cells with the same cell identity label.**HVG conservation**(Luecken et al. 2021): This metric computes the average percentage of overlapping highly variable genes per batch before and after integration.**Isolated label F1**(Luecken et al. 2021): Isolated cell labels are identified as the labels present in the least number of batches in the integration task. The score evaluates how well these isolated labels separate from other cell identities based on clustering.**Isolated label Silhouette**(Luecken et al. 2021): This score evaluates the compactness for the label(s) that is(are) shared by fewest batches. It indicates how well rare cell types can be preserved after integration.**kBET**(Büttner et al. 2018): kBET determines whether the label composition of a k nearest neighborhood of a cell is similar to the expected (global) label composition. The test is repeated for a random subset of cells, and the results are summarized as a rejection rate over all tested neighborhoods.**NMI**(Luecken et al. 2021): NMI compares the overlap of two clusterings. We used NMI to compare the cell-type labels with Louvain clusters computed on the integrated dataset.**PC Regression**(Luecken et al. 2021): This compares the explained variance by batch before and after integration. It returns a score between 0 and 1 (scaled=True) with 0 if the variance contribution hasn’t changed. The larger the score, the more different the variance contributions are before and after integration.**Silhouette**(Luecken et al. 2021): The absolute silhouette with is computed on cell identity labels, measuring their compactness.**Batch ASW**(Luecken et al. 2021): The absolute silhouette width is computed over batch labels per cell. As 0 then indicates that batches are well mixed and any deviation from 0 indicates a batch effect, we use the 1-abs(ASW) to map the score to the scale [0;1].

## Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

## Details

## Methods

**Random Integration by Batch**(Open Problems for Single Cell Analysis Consortium 2022): Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each batch label. Links: Docs.**Random Embedding by Celltype**(Open Problems for Single Cell Analysis Consortium 2022): Cells are embedded as a one-hot encoding of celltype labels. Links: Docs.**Random Graph by Celltype**(Open Problems for Single Cell Analysis Consortium 2022): Cells are embedded as a one-hot encoding of celltype labels. A graph is then built on this embedding. Links: Docs.**Random Integration by Celltype**(Open Problems for Single Cell Analysis Consortium 2022): Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each celltype label. Links: Docs.**Combat (full/scaled)**(Johnson, Li, and Rabinovic 2006): ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.**Combat (full/unscaled)**(Johnson, Li, and Rabinovic 2006): ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.**Combat (hvg/scaled)**(Johnson, Li, and Rabinovic 2006): ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.**Combat (hvg/unscaled)**(Johnson, Li, and Rabinovic 2006): ComBat uses an Empirical Bayes (EB) approach to correct for batch effects. It estimates batch-specific parameters by pooling information across genes in each batch and shrinks the estimates towards the overall mean of the batch effect estimates across all genes. These parameters are then used to adjust the data for batch effects, leading to more accurate and reproducible results. Links: Docs.**FastMNN feature (full/scaled)**(Lun 2019): fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.**FastMNN feature (full/unscaled)**(Lun 2019): fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.**FastMNN feature (hvg/scaled)**(Lun 2019): fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.**FastMNN feature (hvg/unscaled)**(Lun 2019): fastMNN performs a multi-sample PCA to reduce dimensionality, identifying MNN paris in the low-dimensional space, and then correcting the target batch towards the reference using locally weighted correction vectors. The corrected target batch is then merged with the reference. The process is repeated with the next target batch except for the PCA step. Links: Docs.**MNN (full/scaled)**(Haghverdi et al. 2018): MNN first detects mutual nearest neighbours in two of the batches and infers a projection of the second onto the first batch. After that, additional batches are added iteratively. Links: Docs.**MNN (full/unscaled)**(Haghverdi et al. 2018): MNN first detects mutual nearest neighbours in two of the batches and infers a projection of the second onto the first batch. After that, additional batches are added iteratively. Links: Docs.**MNN (hvg/scaled)**(Haghverdi et al. 2018): MNN first detects mutual nearest neighbours in two of the batches and infers a projection of the second onto the first batch. After that, additional batches are added iteratively. Links: Docs.**MNN (hvg/unscaled)**(Haghverdi et al. 2018): MNN first detects mutual nearest neighbours in two of the batches and infers a projection of the second onto the first batch. After that, additional batches are added iteratively. Links: Docs.**No Integration**(Open Problems for Single Cell Analysis Consortium 2022): Cells are embedded by PCA on the unintegrated data. A graph is built on this PCA embedding. Links: Docs.**No Integration by Batch**(Open Problems for Single Cell Analysis Consortium 2022): Cells are embedded by computing PCA independently on each batch. Links: Docs.**Random Integration**(Open Problems for Single Cell Analysis Consortium 2022): Feature values, embedding coordinates, and graph connectivity are all randomly permuted. Links: Docs.**SCALEX (full)**(Xiong et al. 2022): SCALEX is a method for integrating heterogeneous single-cell data online using a VAE framework. Its generalised encoder disentangles batch-related components from batch-invariant biological components, which are then projected into a common cell-embedding space. Links: Docs.**SCALEX (hvg)**(Xiong et al. 2022): SCALEX is a method for integrating heterogeneous single-cell data online using a VAE framework. Its generalised encoder disentangles batch-related components from batch-invariant biological components, which are then projected into a common cell-embedding space. Links: Docs.**Scanorama gene output (full/scaled)**(Hie, Bryson, and Berger 2019): Scanorama is an extension of the MNN method. Other then MNN, it finds mutual nearest neighbours over all batches and embeds observations into a joint hyperplane. Links: Docs.**Scanorama gene output (full/unscaled)**(Hie, Bryson, and Berger 2019): Scanorama is an extension of the MNN method. Other then MNN, it finds mutual nearest neighbours over all batches and embeds observations into a joint hyperplane. Links: Docs.**Scanorama gene output (hvg/scaled)**(Hie, Bryson, and Berger 2019): Scanorama is an extension of the MNN method. Other then MNN, it finds mutual nearest neighbours over all batches and embeds observations into a joint hyperplane. Links: Docs.**Scanorama gene output (hvg/unscaled)**(Hie, Bryson, and Berger 2019): Scanorama is an extension of the MNN method. Other then MNN, it finds mutual nearest neighbours over all batches and embeds observations into a joint hyperplane. Links: Docs.

## Baseline methods

**Random Integration by Batch**: Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each batch label.**Random Embedding by Celltype**: Cells are embedded as a one-hot encoding of celltype labels.**Random Graph by Celltype**: Cells are embedded as a one-hot encoding of celltype labels. A graph is then built on this embedding.**Random Integration by Celltype**: Feature values, embedding coordinates, and graph connectivity are all randomly permuted within each celltype label.**No Integration**: Cells are embedded by PCA on the unintegrated data. A graph is built on this PCA embedding.**No Integration by Batch**: Cells are embedded by computing PCA independently on each batch.**Random Integration**: Feature values, embedding coordinates, and graph connectivity are all randomly permuted.

## Datasets

**Immune (by batch)**(Luecken et al. 2021): Human immune cells from peripheral blood and bone marrow taken from 5 datasets comprising 10 batches across technologies (10X, Smart-seq2).**Lung (Viera Braga et al.)**(Luecken et al. 2021): Human lung scRNA-seq data from 3 datasets with 32,472 cells. From Vieira Braga et al. Technologies: 10X and Drop-seq.**Pancreas (by batch)**(Luecken et al. 2021): Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq).

## Quality control results

✓ All checks succeeded!

## Raw results

## Visualization of raw results

## References

*Nature Methods*16 (1): 43–49. https://doi.org/10.1038/s41592-018-0254-1.

*Nature Biotechnology*36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

*Nature Biotechnology*37 (6): 685–91. https://doi.org/10.1038/s41587-019-0113-3.

*Biostatistics*8 (1): 118–27. https://doi.org/10.1093/biostatistics/kxj037.

*Nature Methods*19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.

*Nature Communications*13 (1). https://doi.org/10.1038/s41467-022-33758-z.