Task 1: Modality Prediction
Predicting the flow of information from DNA to RNA and RNA to Protein
Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA (expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial.
This task requires translating information between multiple layers of gene regulation. In some ways, this is similar to the task of machine translation. In machine translation, the same sentiment is expressed in multiple languages and the goal is to train a model to represent the same meaning in a different language. In this context, the same cellular state is measured in two different feature sets and the goal of this task is to translate the information about cellular state from one modality to the other.
The following section describes the task API for the Modality Prediction task. Competitors must submit their code as a Viash component. To facilitate creation of these components, starter kits have been provided.
Input data formats
Inputs to methods
Method components should expect three inputs,
--input_test_mod1. They are all paths to AnnData h5ad files with the attributes below. More information can be found on AnnData objects here.
mod2 refer to the modality of the datasets as defined by
feature_type. One file will always have
"GEX" and the other will be
"ADT". For the purposes of the competition, components should expect the following 4 combinations of modalities:
Objective for methods
Submission components must predict
mod2 for the cells provided in
--input_test_mod1. For methods that do not involve pre-trained models, training data is also provided in the
--input_train_mod[1|2] files. For methods that involve pre-trained models, these training datasets can be ignored.
Attributes of input data
The input data objects have the following attributes:
adata Input AnnData object for modality 1 or 2 Attributes ---------- adata.X : ndarray, shape=(n_obs, n_var) Sparse profile matrix of given modality. If .var['feature_types'] == "GEX" or "ADT", values in adata.X represent expression counts for each gene. If .var['feature_types'] == "ATAC", values represent counts of reads in peaks for chromatin accessibility adata.uns['dataset_id'] : str The name of the dataset. adata.obs["batch"] : ndarray, shape=(n_obs,) The batch from which the data was sequenced. Has format "s[1-4]d[1-9]" indicating the site and donor associated with the batch. adata.obs_names : ndarray, shape=(n_obs,) Ids for the cells. adata.var['feature_types']: ndarray, shape=(n_var,) The modality of this file, should be equal to "GEX", "ATAC" or "ADT". adata.var_names : ndarray, shape=(n_var,) Ids for the features.
Examples of how to load and process the data are contained in the starter kits for the respective programming language.
Normalization and transformation of data for the prediction task
To make the task more straightforward, we have followed common practices for normalizing and transforming data of each modality. The raw data is also provided in
adata.layers as described below. Please note, the performance metric will be calculated on the normalized and transformed data stored in
adata.X for each of the modality types below.
For full details on preprocessing, see the Data Preprocessing notes.
For this task, gene expression data stored in
adata.X for the training and test data has been size-factor normalized and log1p transformed. Raw UMI counts are available in
adata.layers["counts"]. Size factors are accessible in
For this task, ATAC data stored in
adata.X for the training and test data has been binarized and subset to 10000 random peaks. The raw UMI counts for each peak can be found in
For this task, ADT derived protein abundance measures have been centered log-ration (CLR) normalized. Raw ADT counts can be found in
Output data formats
This component should output only one h5ad file whose path is specified via
--output, containing the predicted profile values of modality 2 for the test cells only. It must have the following attributes:
adata Output AnnData object containing predictions for modality 2 in the "test" cells Attributes ---------- adata.X : ndarray, shape=(n_obs, n_var) Sparse profile matrix. adata.uns['dataset_id'] : str The name of the dataset. adata.uns['method_id'] : str The name of the prediction method. This is used to track submissions. adata.var['feature_types'] : ndarray, shape=(n_var,) The modality of this file, should be equal to "GEX", "ATAC" or "ADT". adata.obs_names : ndarray, shape=(n_obs,) Ids for the cells.
Performance in task 1 is measured using the root mean squared error between the observed and predicted values for modality 2 in the
test set. Lower values are better.
The metric function used to evaluate the prediction has the following structure (this example employs
Python syntax; the
R evaluation function is functionally equivalent):
def calculate_rmse(adata_mod2, adata_mod2_answer): '''Function to calculate MSE between prediction and solution for the test sets Params ------ adata_mod2 : AnnData, shape=(n_obs, n_var) User-submitted prediction for expression of mod2 in cells from the test set adata_mod2_answer : AnnData, shape=(n_obs, n_var) Measured values for expression of mod2 in the test set Returns ------- mean_square_error : float The mean squared error between the predicted and observed values for all features in the test set. ''' from sklearn.metrics import mean_square_error return mean_square_error(adata_mod2.X, adata_mod2_answer.X, squared=False)
For this task, five prizes of $1000 will be awarded to the submissions for each of the following criteria:
- Best performance predicting GEX → ATAC
- Best performance predicting ATAC → GEX
- Best performance predicting GEX → ADT
- Best performance predicting ADT → GEX
- Best performance on average across modalities
Terms and Conditions apply.