```
poss_dataset_ids = dataset_info
.map(d => d.dataset_id)
.filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
.map(d => d.method_id)
.filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
.map(d => d.metric_id)
.filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))
```

# Perturbation Prediction

Predicting how small molecules change gene expression in different cell types.

1 datasets · 6 methods · 6 control methods · 5 metrics

Task info Method info Metric info Dataset info Results

Human biology can be complex, in part due to the function and interplay of the body’s approximately 37 trillion cells, which are organized into tissues, organs, and systems. However, recent advances in single-cell technologies have provided unparalleled insight into the function of cells and tissues at the level of DNA, RNA, and proteins. Yet leveraging single-cell methods to develop medicines requires mapping causal links between chemical perturbations and the downstream impact on cell state. These experiments are costly and labor intensive, and not all cells and tissues are amenable to high-throughput transcriptomic screening. If data science could help accurately predict chemical perturbations in new cell types, it could accelerate and expand the development of new medicines.

Several methods have been developed for drug perturbation prediction, most of which are variations on the autoencoder architecture (Dr.VAE, scGEN, and ChemCPA). However, these methods lack proper benchmarking datasets with diverse cell types to determine how well they generalize. The largest available training dataset is the NIH-funded Connectivity Map (CMap), which comprises over 1.3M small molecule perturbation measurements. However, the CMap includes observations of only 978 genes, less than 5% of all genes. Furthermore, the CMap data is comprised almost entirely of measurements in cancer cell lines, which may not accurately represent human biology.

This task aims to predict how small molecules change gene expression in different cell types. This task was a Kaggle competition as part of the NeurIPS 2023 competition track.

The task is to predict the gene expression profile of a cell after a small molecule perturbation. For this competition, we designed and generated a novel single-cell perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures (LINCS) Connectivity Map dataset (PMID: 29195078) and measured single-cell gene expression profiles after 24 hours of treatment. The experiment was repeated in three healthy human donors, and the compounds were selected based on diverse transcriptional signatures observed in CD34+ hematopoietic stem cells (data not released). We performed this experiment in human PBMCs because the cells are commercially available with pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue that contains multiple mature cell types (including T-cells, B-cells, myeloid cells, and NK cells) with established markers for annotation of cell types. To supplement this dataset, we also measured cells from each donor at baseline with joint scRNA and single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope that the addition of rich multi-omic data for each donor and cell type at baseline will help establish biological priors that explain the susceptibility of particular genes to exhibit perturbation responses in difference biological contexts.

## Summary

## Display settings

## Filter datasets

## Filter methods

## Filter metrics

## Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

## Dataset info

## Show

### NeurIPS2023 scPerturb DGE

Differential gene expression sign(logFC) * -log10(p-value) values after 24 hours of treatment with 144 compounds in human PBMCs (**TBD?**).

For this competition, we designed and generated a novel single-cell perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures (LINCS) Connectivity Map dataset (PMID: 29195078) and measured single-cell gene expression profiles after 24 hours of treatment. The experiment was repeated in three healthy human donors, and the compounds were selected based on diverse transcriptional signatures observed in CD34+ hematopoietic stem cells (data not released). We performed this experiment in human PBMCs because the cells are commercially available with pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue that contains multiple mature cell types (including T-cells, B-cells, myeloid cells, and NK cells) with established markers for annotation of cell types. To supplement this dataset, we also measured cells from each donor at baseline with joint scRNA and single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope that the addition of rich multi-omic data for each donor and cell type at baseline will help establish biological priors that explain the susceptibility of particular genes to exhibit perturbation responses in difference biological contexts.

## Method info

## Show

### LSTM-GRU-CNN Ensemble

An ensemble of LSTM, GRU, and 1D CNN models. Links: Docs.

An ensemble of LSTM, GRU, and 1D CNN models with a variety of input features derived from ChemBERTa embeddings, one-hot encoding of cell type/small molecule pairs, and various statistical measures of target gene expression. The models were trained with a combination of MSE, MAE, LogCosh, and BCE loss functions to improve their robustness and predictive performance. The approach also included data augmentation techniques to ensure generalization and account for noise in the data.

### NN retraining with pseudolabels

Neural networks with pseudolabeling and ensemble modelling. Links: Docs.

The prediction system is two staged, so I publish two versions of the notebook. The first stage predicts pseudolabels. To be honest, if I stopped on this version, I would not be the third. The predicted pseudolabels on all test data (255 rows) are added to training in the second stage.

**Stage 1 preparing pseudolabels**: The main part of this system is a neural network. Every neural network and its environment was optimized by optuna. Hyperparameters that have been optimized: a dropout value, a number of neurons in particular layers, an output dimension of an embedding layer, a number of epochs, a learning rate, a batch size, a number of dimension of truncated singular value decomposition. The optimization was done on custom 4-folds cross validation. In order to avoid overfitting to cross validation by optuna I applied 2 repeats for every fold and took an average. Generally, the more, the better. The optuna’s criterion was MRRMSE. Finally, 7 models were ensembled. Optuna was applied again to determine best weights of linear combination. The prediction of test set is the pseudolabels now and will be used in second stage.

**Stage 2 retraining with pseudolabels**: The pseudolabels (255 rows) were added to the training dataset. I applied 20 models with optimized parameters in different experiments for a model diversity. Optuna selected optimal weights for the linear combination of the prediction again. Models had high variance, so every model was trained 10 times on all dataset and the median of prediction is taken as a final prediction. The prediction was additionally clipped to colwise min and max.

### JN-AP-OP2

Deep learning architecture composed of 2 modules: a sample-centric MLP and a gene-centric MLP. Links: Docs.

We first encode each sample using leave-one-out encoder based on compound and cell type. This produces X with the dimension of n_samples, n_genes, n_encode, where n_encode is 2. Then, X is passed to a MLP1 sample-wise with input of n_samples, n_genes*n_encode, which outputs the same dimension data. The purpose of this MLP is to learn inter-gene relationships. Then, we group the output of MLP1 with X (original encoded data) and feed it to MLP2 which receives n_smaples*n_genes, (n_encode + n_encode) and results n_samples*n_genes. This MLP2 trains on each (compound, cell_type, gene) combination. This is to overcome the underdetermination problem due to lack of sufficient (compound, cell_type) samples.

### ScAPE

Neural network model for drug effect prediction (**pablormier2023scape?**). Links: Docs.

ScAPE is utilises a neural network (NN) model to estimate drug effects on gene expression in peripheral blood mononuclear cells (PBMCs). The model took drug and cell features as input, with these features primarily derived from the median of signed log-pvalues and log fold-changes grouped by drug and cell type. The NN was trained using a leave-one-drug-out cross-validation strategy, focusing on NK cells as a representative cell type due to their similarity to B cells and Myeloid cells in principal component analysis. Model performance was evaluated by comparing its predictions against two baselines: predicting zero effect and predicting the median log-pvalue for each drug. The final submission combined predictions from models trained on different gene and drug subsets, aiming to enhance overall prediction accuracy.

### Transformer Ensemble

An ensemble of four transformer models, trained on diverse feature sets, with a cluster-based sampling strategy and robust validation for optimal performance. Links: Docs.

This method employs an ensemble of four transformer models, each with different weights and trained on slightly varying feature sets. The feature engineering process involved one-hot encoding of categorical labels, target encoding using mean and standard deviation, and enriching the feature set with the standard deviation of target variables. Additionally, the dataset was carefully examined to ensure data cleanliness. A sophisticated sampling strategy based on K-Means clustering was employed to partition the data into training and validation sets, ensuring a representative distribution. The model architecture leveraged sparse and dense feature encoding, along with a transformer for effective learning.

### Py-boost

Py-boost predicting t-scores. Links: Docs.

An ensemble of four models was considered:

- Py-boost (a ridge regression-based recommender system)
- ExtraTrees (a decision tree ensemble with target-encoded features)
- a k-nearest neighbors recommender system
- a ridge regression model

Each model offered distinct strengths and weaknesses: ExtraTrees and knn were unable to extrapolate beyond the training data, while ridge regression provided extrapolation capability. To enhance model performance, data augmentation techniques were used, including averaging differential expressions for compound mixtures and adjusting cell counts to reduce biases.

In the end, only the py-boost model is used for generating predictions.

## Control method info

## Show

### Ground truth

Returns the ground truth predictions

The identity function that returns the ground-truth information as the output.

### Mean per gene

Baseline method that returns mean of gene’s outcomes

Baseline method that predicts for a gene the mean of its outcomes of all samples.

### Mean per cell type and gene

Baseline method that returns mean of cell type’s outcomes

Baseline method that predicts for a cell type the mean of its outcomes of all compounds.

### Mean per compound and gene

Baseline method that returns mean of compound’s outcomes

Baseline method that predicts for a compound the mean of its outcomes of all samples.

### Sample

Sample predictions from the training data

This method samples the training data to generate predictions.

### Zeros

Baseline method that predicts all zeros

Baseline method that predicts all zeros.

## Metric info

## Show

### Mean Rowwise RMSE

The mean of the root mean squared error (RMSE) of each row in the matrix.

We use the **Mean Rowwise Root Mean Squared Error** to score submissions, computed as follows:

\textrm{MRRMSE} = \frac{1}{R}\sum_{i=1}^R\left(\frac{1}{n} \sum_{j=1}^{n} (y_{ij} - \widehat{y}_{ij})^2\right)^{1/2}

where (R) is the number of scored rows, and (y_{ij}) and (\widehat{y}_{ij}) are the actual and predicted values, respectively, for row (i) and column (j), and (n) bis the number of columns.

### Mean Rowwise MAE

The mean of the absolute error (MAE) of each row in the matrix.

We use the **Mean Rowwise Absolute Error** to score submissions, computed as follows:

\textrm{MRMAE} = \frac{1}{R}\sum_{i=1}^R\left(\frac{1}{n} \sum_{j=1}^{n}y_{ij} - \widehat{y}_{ij}\right)

where (R) is the number of scored rows, and (y_{ij}) and (\widehat{y}_{ij}) are the actual and predicted values, respectively, for row (i) and column (j), and (n) bis the number of columns.

### Mean Rowwise Pearson

The mean of Pearson correlations per row (perturbation).

The **Mean Pearson Correlation** is computed as follows:

\textrm{Mean-Pearson} = \frac{1}{R}\sum_{i=1}^R\frac{\textrm{Cov}(\mathbf{y}_i, \mathbf{\hat{y}}_i)}{\textrm{Var}(\mathbf{y}_i) \cdot \textrm{Var}(\mathbf{\hat{y}}_i)}

where (R) is the number of scored rows, and (\mathbf{y}_i) and (\mathbf{\hat{y}}_i) are the actual and predicted values, respectively, for row (i).

### Mean Rowwise Spearman

The mean of Spearman correlations per row (perturbation).

The **Mean Spearman Correlation** is computed as follows:

\textrm{Mean-Pearson} = \frac{1}{R}\sum_{i=1}^R\frac{\textrm{Cov}(\mathbf{r}_i, \mathbf{\hat{r}}_i)}{\textrm{Var}(\mathbf{r}_i) \cdot \textrm{Var}(\mathbf{\hat{r}}_i)}

where (R) is the number of scored rows, and (\mathbf{r}_i) and (\mathbf{\hat{r}}_i) are the ranks of the actual and predicted values, respectively, for row (i).

### Mean Rowwise Cosine

The mean of cosine similarities per row (perturbation).

The **Mean Cosine Similarity** is computed as follows:

\textrm{Mean-Cosine} = \frac{1}{R}\sum_{i=1}^R\frac{\mathbf{y}_i\cdot \mathbf{\hat{y}}_i}{\\mathbf{y}_i\ \\mathbf{\hat{y}}_i\}

where (R) is the number of scored rows, and (\mathbf{y}_i) and (\mathbf{\hat{y}}_i) are the actual and predicted values, respectively, for row (i).

## Quality control results

## Show

Category | Name | Value | Condition | Severity |
---|---|---|---|---|

Method info | Pct 'paper_reference' missing | 0.4166667 | percent_missing(method_info, field) | ✗✗ |

Metric info | Pct 'paper_reference' missing | 1.0000000 | percent_missing(metric_info, field) | ✗✗ |