Note: This documentation has recently been migrated. You can find the old documentation at here.

Repository

In the OpenProblems codebase, the src directory contains Viash components that manage various aspects of the project, such as common datasets, tasks, and common processing components. The target folder is where artifacts generated from these Viash components are stored, including Dockerized Nextflow modules. The resources_test directory contains the test resources required for running unit tests on the Viash components. It is important to note that these test resources are not stored within the git repository. Instead, they are obtained by running a sync resources script in the scripts directory.

The main data flow of the pipeline is shown in (Figure 1). The common dataset components create common dataset objects which are used in one or more tasks.

Figure 1: Flow of data in OpenProblems benchmarks. All datasets are processed by a common processing pipeline. Further task-specific processing can occur at prior to the task-specific benchmarking workflow. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components

Directory Structure

  • src/common: This subdirectory contains helper components that helps with creating new components, unit testing other components, or managing task results.
  • src/datasets: The dataset processing pipeline uses dataset loaders to create raw dataset files. The raw dataset files are then processed to generate common dataset files. Common dataset files are used in one or more tasks.
  • src/tasks/<task_id>: Each task should contain a data processor (to transform common datasets into task-specific datasets), methods, control methods (for quality control), and metrics.
  • resources_test: This directory contains the test resources required for running unit tests on the Viash components
  • target: This directory contains the artifacts built from the Viash components in the src directory.

Technology stack

  • AnnData: A file format designed for handling annotated, high-dimensional biological data Loading citations. In OpenProblems, AnnData serves as the standard data format for both input and output files of components, ensuring a consistent and seamless exchange of data between different components of the benchmarking pipelines.

  • AWS: Amazon Web Services provides scalable and cost-effective cloud computing and storage. AWS is being used to store datasets, test resources, and run the Nextflow benchmarking pipelines.

  • CELLxGENE Census: A cloud-based library of single-cell RNA sequencing (scRNA-seq) datasets, developed by the Chan Zuckerberg Initiative. OpenProblems uses the CELLxGENE Census platform to fetch datasets for benchmarking.

  • Docker: Provides a consistent and reproducible environment for building, packaging, and deploying applications and dependencies across different platforms. Docker images are generated by Viash and stored on ghcr.io.

  • GitHub Actions: A continuous integration and continuous deployment (CI/CD) platform integrated with GitHub. This project uses GitHub Actions to perform continuously build and unit test the components in the project.

  • Nextflow: A workflow management system that simplifies the design, deployment, and execution of complex data processing pipelines, enabling seamless scaling and parallelization. All Nextflow modules are generated by Viash and are stored in the target/nextflow/ folder in the project releases (and on the main_build branch).

  • Python: A widely used, high-level programming language, offering extensive libraries and packages for data manipulation, analysis, and machine learning. Most of the OpenProblems components are written in Python.

  • R: A programming language and software environment for statistical computing and graphics, widely used in data analysis and bioinformatics. OpenProblems also offers support for R components.

  • Viash: A tool that facilitates the creation of modular pipeline components by allowing developers to combine a code block or script with a small amount of metadata Loading citations. Viash components are used in OpenProblems for dataset loaders, dataset processors, methods, and metrics, enabling developers to focus on the core functionality of their components without worrying about the chosen pipeline framework.