graph LR classDef component fill:#decbe4,stroke:#333 classDef anndata fill:#d9d9d9,stroke:#333 subgraph Common dataset components dataset_loader[/Dataset<br/>loader/]:::component raw_dataset[Raw<br/>dataset]:::anndata preprocessing[/Pre-processing/]:::component common_dataset[Common<br/>dataset]:::anndata end subgraph Task-specific components task_benchmark[/Benchmarking<br/>workflow/]:::component results[Results]:::anndata end dataset_loader --> raw_dataset --> preprocessing --> common_dataset --> task_benchmark --> results
openproblems-v2
In the OpenProblems codebase, the src
directory contains Viash components that manage various aspects of the project, such as common datasets, tasks, and common processing components. The target
folder is where artifacts generated from these Viash components are stored, including Dockerized Nextflow modules. The resources_test
directory contains the test resources required for running unit tests on the Viash components. It is important to note that these test resources are not stored within the git repository. Instead, they are obtained by running the sync test resources component (See “Getting started”).
The main data flow of the pipeline is shown in Figure 1. The common dataset components create common dataset objects which are used in one or more tasks.
Technology stack
AnnData: A file format designed for handling annotated, high-dimensional biological data (Virshup et al. 2021). In OpenProblems, AnnData serves as the standard data format for both input and output files of components, ensuring a consistent and seamless exchange of data between different components of the benchmarking pipelines.
AWS: Amazon Web Services provides scalable and cost-effective cloud computing and storage. AWS is being used to store datasets, test resources, and run the Nextflow benchmarking pipelines.
Docker: Provides a consistent and reproducible environment for building, packaging, and deploying applications and dependencies across different platforms. Docker images are generated by Viash and stored on ghcr.io.
GitHub Actions: A continuous integration and continuous deployment (CI/CD) platform integrated with GitHub. This project uses GitHub Actions to perform continuously build and unit test the components in the project.
Nextflow: A workflow management system that simplifies the design, deployment, and execution of complex data processing pipelines, enabling seamless scaling and parallelization. All Nextflow modules are generated by Viash and are stored in the
target/nextflow/
folder in the project releases (and on themain_build
branch).Python: A widely used, high-level programming language, offering extensive libraries and packages for data manipulation, analysis, and machine learning. Most of the OpenProblems components are written in Python.
R: A programming language and software environment for statistical computing and graphics, widely used in data analysis and bioinformatics. OpenProblems also offers support for R components.
Viash: A tool that facilitates the creation of modular pipeline components by allowing developers to combine a code block or script with a small amount of metadata (Cannoodt et al. 2021). Viash components are used in OpenProblems for dataset loaders, dataset processors, methods, and metrics, enabling developers to focus on the core functionality of their components without worrying about the chosen pipeline framework.