Our goal is to facilitate the development of novel computational methods to address open problems in the single-cell field. We are focused on bridging the gap between experts in computer science and machine learning and the biological problems associated with the single cell data. We want to identify important problems, aggregate standardized datasets, and create a platform to benchmark novel methods against the current state of the art using a common set of test metrics.
We want to build a diverse and inclusive community to support the Open Problems. As such we welcome any individual who wants to get involved and agrees to follow our Code of Conduct. We are currently supported by the Chan Zuckerberg Initiative and welcome participation from labs across the single cell and/or machine learning communities, and in particular labs already involved in the Single Cell Biology Seed Networks.
We have broken down the development of Single Cell Open Problems into tasks. A task is a specific quantifiable problem that addresses an open problem in the single-cell field. An example of a task is Multimodal Data Integration in which the goal is to align single-cell measurements of different -omics modalities that will enable us to build increasingly complex characterizations of cell types and states. #We evaluate these methods by using datasets where multimodal data is measured on the exact same cells (e.g. joint single cell RNA and ATAC profiling) to assess which methods correctly matched these measurements without using cell barcodes.
Each task is composed of three components:
All of the code for these tasks are hosted in an open source GitHub repository.
Datasets are collections of single cell measurements that can be used for benchmarking a task. To add a dataset, you need to set up two components:
The role of the data downloader is to grab the data from a public repository, perform any necessary preprocessing, and return an
The API of a data downloader is
function dataset(bool test=False) -> AnnData adata
test is True, then the method should load the full dataset, but only return a small version of the same data (preferably <200 cells and <500 genes) that can be used for testing purposes. The loaded AnnData objects are then used to evaluate various methods.
Next, we need a task-specific data loader that loads the data in a way that’s formatted correctly for a given task. The specific data format for each task can be found in the
README.md file in each
openproblems/tasks/<task_name> directory. Generally speaking,
adata.X should contain UMI counts (or equivalent). For example, the label projection task has the following requirements:
Datasets should contain the following attributes:
adata.obs["labels"](ground truth celltype labels)
adata.obs["is_train"](train vs. test boolean)
Note, we may be able to prepare a single dataset multiple ways for a single task. For example, in the zebrafish dataset that comprises single cell profiles from two different labs, we can create a train/test split based on the lab or by randomly splitting cells regardless of where they were measured.
Metrics are used to compare the output of each method and are task-specific. You can find a thorough discussion of model evaluation metrics in the
sklearn User Guide. The API of a metric is as follows:
function metric(adata) -> float
We encourage developers to submit a variety of metrics for each task since each method for model evaluation has specific biases. For example, a recent comparison of dataset integration methods used 14 different evaluation metrics to compare methods.
Methods are the backbone of Single Cell Open Problems, and we hope that this is where most of the development will occur. Like metrics, methods are task-specific. The exact API of each method can be found in each
openproblems/task/<task_name>/methods/ directory. For example, the label projection task has the following API
Methods should assign celltype labels to
adata.obs['labels_pred']using only the labels from the training data. The true labels are contained in
function _labelprojection(adata) -> adata.obs['labels_pred']
_ precedes the method name because it is not intended to be called directly during benchmarking. Instead, this method will be combined with preprocessing functions to create a full method as described in the following section.
Preprocessing is a major factor affecting the output of a single cell analysis pipeline, yet there is little consensus on the optimal set of preprocessing steps for any given single cell task. Our approach is to provide multiple preprocessing options that can be easily combined with any given method.
Functions for normalization (accounting for varying UMIs per cell) and transformation (scaling for differences in average detection of each gene) can be found in
We currently provide three flavors of normalization:
log_cpm- log-transformed, counts-per-million normalized
sqrt_cpm- sqrt-transformed, counts-per-million normalized
log_scran_pooling- log-transformed, scran normalized
To define a full method, we need to combine the base method above with a preprocessing function. For example, in the
logistic_regression.py script, we define a
_logistic_regression() base function and then combine it with preprocessing steps in the
Our current infrastructure will evaluate the performance of a method once the code has been merged to the
master branch. We encourage you to take advantage of the test versions of each dataset and ensure that a new method will run properly on each dataset. Next, the easiest way to evaluate the performance of the method against all the datasets assigned to a task is to submit a pull request on GitHub.
To add a new task, you need to collect all three of the above components of a task:
We’d love to see new tasks added to the framework, and our core group of developers can help get new tasks off the ground. We already have some proposed tasks in our GitHub Issues tracker. Join us on Slack to get started!
If you have any questions or would like to get more involved, please join us on the CZI Science Slack. Once you’ve created an account, look for the