## Predicting the flow of information from DNA to RNA and RNA to Protein

Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA (expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial.

This task requires translating information between multiple layers of gene regulation. In some ways, this is similar to the task of machine translation. In machine translation, the same sentiment is expressed in multiple languages and the goal is to train a model to represent the same meaning in a different language. In this context, the same cellular state is measured in two different feature sets and the goal of this task is to translate the information about cellular state from one modality to the other.

The following section describes the task API for the Modality Prediction task. Competitors must submit their code as a Viash component. To facilitate creation of these components, starter kits have been provided.

### Input data formats

The data format and attributes provided in the input data is tailored to each task and may differ from the publicly released benchmarking dataset. Only the attributes listed in the following section will be accessible to methods submitted to the competition.

#### Inputs to methods

Method components should expect three inputs, --input_train_mod1, --input_train_mod2, and --input_test_mod1. They are all paths to AnnData h5ad files with the attributes below. More information can be found on AnnData objects here. mod1 and mod2 refer to the modality of the datasets as defined by feature_type. One file will always have feature_type be "GEX" and the other will be "ATAC" or "ADT". For the purposes of the competition, components should expect the following 4 combinations of modalities:

mod1mod2
"GEX""ATAC"
"ATAC""GEX"
"GEX""ADT"
"ADT""GEX"

#### Objective for methods

Submission components must predict mod2 for the cells provided in --input_test_mod1. For methods that do not involve pre-trained models, training data is also provided in the --input_train_mod[1|2] files. For methods that involve pre-trained models, these training datasets can be ignored.

Note, you do not need to return predictions for all four combinations of inputs and outputs. We will be independently ranking and awarding prizes to each combination as described below in Prizes. For more details, see the FAQs

#### Attributes of input data

The input data objects have the following attributes:

adata
Input AnnData object for modality 1 or 2

Attributes
----------
Sparse profile matrix of given modality. If .var['feature_types'] == "GEX" or "ADT",
values in adata.X represent expression counts for each gene. If
.var['feature_types'] == "ATAC", values represent counts of reads in peaks for
chromatin accessibility
The name of the dataset.
The batch from which the data was sequenced. Has format "s[1-4]d[1-9]" indicating the site and
donor associated with the batch.
Ids for the cells.
The modality of this file, should be equal to "GEX", "ATAC" or "ADT".
Ids for the features.


Examples of how to load and process the data are contained in the starter kits for the respective programming language.

#### Normalization and transformation of data for the prediction task

To make the task more straightforward, we have followed common practices for normalizing and transforming data of each modality. The raw data is also provided in adata.layers as described below. Please note, the performance metric will be calculated on the normalized and transformed data stored in adata.X for each of the modality types below.

For full details on preprocessing, see the Data Preprocessing notes.

GEX

For this task, gene expression data stored in adata.X for the training and test data has been size-factor normalized and log1p transformed. Raw UMI counts are available in adata.layers["counts"]. Size factors are accessible in adata.obs["size_factors"]

ATAC

For this task, ATAC data stored in adata.X for the training and test data has been binarized and subset to 10000 random peaks. The raw UMI counts for each peak can be found in adata.layers["counts"].

For this task, ADT derived protein abundance measures have been centered log-ration (CLR) normalized. Raw ADT counts can be found in adata.layers["counts"].

### Output data formats

This component should output only one h5ad file whose path is specified via --output, containing the predicted profile values of modality 2 for the test cells only. It must have the following attributes:

adata
Output AnnData object containing predictions for modality 2 in the "test" cells

Attributes
----------
Sparse profile matrix.
The name of the dataset.
The name of the prediction method. This is used to track submissions.
The modality of this file, should be equal to "GEX", "ATAC" or "ADT".
Ids for the cells.


### Metric

Performance in task 1 is measured using the root mean squared error between the observed and predicted values for modality 2 in the test set. Lower values are better.

The metric function used to evaluate the prediction has the following structure (this example employs Python syntax; the R evaluation function is functionally equivalent):

def calculate_rmse(adata_mod2, adata_mod2_answer):
'''Function to calculate MSE between prediction and solution for the test sets

Params
------
User-submitted prediction for expression of mod2 in cells from the test set
Measured values for expression of mod2 in the test set

Returns
-------
mean_square_error : float
The mean squared error between the predicted and observed values for all features in
the test set.
'''
from sklearn.metrics import mean_square_error