Philosophy

OpenProblems is a living benchmarking platform designed to address and measure progress towards open challenges in single-cell genomics. It follows the Common Task Framework (CTF), which has driven innovation in machine learning research by providing clear definitions and quantifications of progress (Donoho 2017).

The platform combines an open GitHub repository with community-defined tasks, an automated benchmarking workflow, and a website for exploring the results. Each task consists of datasets, methods, and metrics (Figure 1). Datasets define the input and ground truth for a task, methods attempt to solve the task, and metrics evaluate the success of a method on a given dataset.

graph LR
  classDef component fill:#decbe4,stroke:#333
  classDef anndata fill:#d9d9d9,stroke:#333
  loader[/Dataset<br/>loader/]:::component
  dataset[Dataset]:::anndata
  method[/Method/]:::component
  output[Output]:::anndata
  metric[/Metric/]:::component
  score[Score]:::anndata
  loader --> dataset --- method --> output --- metric --> score
Figure 1: The structure of an OpenProblems task. Legend: Grey rectangles represent AnnData .h5ad files, while purple rhomboids represent Viash components.

A benchmarking pipeline in OpenProblems consists of AnnData datasets.(Virshup et al. 2021) and Viash components (Cannoodt et al. 2021), which both contribute to consistency and interoperability in OpenProblems by promoting a standardized data format and modular component structure.

AnnData, short for “Annotated Data”, is a file format designed for handling annotated, high-dimensional biological data. In OpenProblems, AnnData serves as the standard data format for both input and output files of components, ensuring a consistent and seamless exchange of data between different components of the benchmarking pipelines.

Viash is a tool that facilitates the creation of modular pipeline components by allowing developers to combine a code block or script with a small amount of metadata. Viash components are used in OpenProblems for dataset loaders, dataset processors, methods, and metrics, enabling developers to focus on the core functionality of their components without worrying about the chosen pipeline framework.

To facilitate seamless community involvement, OpenProblems has designed its infrastructure to take advantage of automated workflows through GitHub Actions, Nextflow, and AWS Batch, as well as the integration of Viash components. When a community member adds a task, dataset, method, or metric, the new contributions are automatically tested in the cloud. Once all tests pass and the new contribution is merged into the main repository, the results from the new contribution are automatically submitted to the OpenProblems website.

Overview of the OpenProblems repositories, mainly consisting of the main repository and the website. For detailed information on how this project is structured, see the reference documentation.

OpenProblems aims to raise the standards for method selection and evaluation in single-cell data science by offering a platform that quantitatively defines open challenges, determines current state-of-the-art solutions, promotes method development, and monitors progress towards these goals. By leveraging Viash components, the platform ensures consistency, modularity, and interoperability, making it a valuable resource for data analysts, method developers, and the single-cell genomics community at large.

References

Cannoodt, Robrecht, Hendrik Cannoodt, Eric Van de Kerckhove, Andy Boschmans, Dries De Maeyer, and Toni Verbeiren. 2021. “Viash: From Scripts to Pipelines.” arXiv. https://doi.org/10.48550/ARXIV.2110.11494.
Donoho, David. 2017. “50 Years of Data Science.” Journal of Computational and Graphical Statistics 26 (4): 745–66. https://doi.org/10.1080/10618600.2017.1384734.
Virshup, Isaac, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. 2021. “Anndata: Annotated Data.” https://doi.org/10.1101/2021.12.16.473007.