Concepts

OpenProblems benchmarking pipelines consist of Viash components and AnnData HDF5 files.

graph LR
  loader[/Dataset<br/>loader/]:::component
  dataset[Dataset]:::anndata
  method[/Method/]:::component
  output[Output]:::anndata
  metric[/Metric/]:::component
  score[Score]:::anndata
  loader --> dataset --- method --> output --- metric --> score
Figure 1: The structure of an OpenProblems task. Legend: Grey rectangles represent AnnData .h5ad files, while purple rhomboids represent Viash components.

AnnData file format

AnnData, short for “Annotated Data”, is a file format for handling annotated, high-dimensional biological data (Virshup et al. 2021). In the context of OpenProblems, AnnData is used as the standard data format for both input and output files of components. This ensures a consistent and seamless exchange of data between different components of the benchmarking pipelines, allowing developers to focus on the core functionality of their components without worrying about data format compatibility.

Figure 2: AnnData objects have a structured format that includes the main data matrix (X, e.g. gene expression values), annotations of observations (obs, e.g. cell metadata), annotations of variables (var, e.g. gene metadata), and unstructured annotations (uns). This organization makes it easy to work with complex datasets while maintaining data integrity and ensuring a standardized structure across different components.

Files with the .h5ad extension represent AnnData objects stored in an HDF5 file. AnnData objects can be opened in Python using the anndata.read_h5ad() function, and in R using the anndata::read_h5ad() function. Technically it can be read in any language using an HDF5 library.

Common dataset format

In OpenProblems, most of the AnnData datasets will have a subset of the following slots:

AnnData object
  obs: 'celltype', 'batch', 'tissue', 'size_factors'
  var: 'hvg', 'hvg_score'
  obsm: 'X_pca'
  obsp: 'knn_distances', 'knn_connectivities'
  varm: 'pca_loadings'
  layers: 'counts', 'normalized'
  uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'pca_variance', 'knn'
Note

In OpenProblems, the X slot in the AnnData objects is typically not defined (None in Python, NULL in R). Instead, the raw counts and normalised expression data are defined as layers.

Please visit the reference documentation on the common dataset file format used by OpenProblems for more information on each of the different slots.

Dataset format specifications

OpenProblems tasks contain specifications for the required and optional slots in AnnData files. These specifications are defined in the src/tasks/*/api/file_*.yaml. For more information on how these specifications are formatted, see “Design the API”.

Viash component

A Viash component is a combination of a code block or script and a small amount of metadata that makes it easy to generate pipeline modules, facilitating the separation of component functionality from the pipeline workflow (Cannoodt et al. 2021). This enables developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.

A Viash component consists of three main parts: a Viash config, a script, and one or more unit tests (Figure 3).

Figure 3: Viash supports robust pipeline development by allowing users to build their component as a standalone executable (with auto-generated CLI), build a Docker container to run the script inside, or turn the component into a standalone Nextflow module.

Example component

When creating a new Viash component, you will need to create a script and a config file. Below is an example of a simple Viash component written in Python. For a detailed explanation of the structure of a Viash component in more programming languages (e.g. Bash, Python, R), please visit review documentation on how to create a Viash component.

Script

To help with prototyping, a Viash component’s script should be runnable. You can do this by defining a Viash placeholder code block denoted by the ## VIASH START and ## VIASH END comments.

The Viash placeholder should contain a dictionary (or named list) which contains example values for all of the input/output arguments your component needs. During execution, Viash will remove the placeholder and replace it with the actual parameter values at runtime.

script.py
import anndata

## VIASH START
# Note: This codeblock is for prototyping and
# is removed by Viash at runtime.
par = {
  'input': 'test_resource.txt',
  'output': 'output.txt'
}
## VIASH END

# Print par
print(f"par: {par}")

# Read input file
adata = anndata.read_h5ad(par["input"])

# Print adata
print(f"adata: {adata}")

# Write output file
adata.write_h5ad(par["output"])
Tip

It’s possible to store helper functions in separate files. Visit the Viash documentation for more information.

Viash config

Next, we’ll write a Viash config file for this component.

config.vsh.yaml
functionality:
  name: mycomponent
  description: |
    A multiline description.
  arguments:
    - name: "--input"
      type: file
      description: Input h5ad
      example: input.h5ad
      required: true
    - name: "--output"
      type: file
      direction: output
      description: Output g5ad
      example: output.h5ad
      required: true
  resources:
    - type: python_script
      path: script.py
    tests:
    - type: python_script
      path: test.py
platforms:
  - type: docker
      image: "python:3.10"
      setup:
        - type: python
          pypi: anndata
  - type: native
  - type: nextflow
Tip

After adding more arguments to the config, you can update your script using the viash config inject

Component format specifications

OpenProblems tasks contain specifications for the common config fields between the same component types. These specifications are defined in the src/tasks/*/comp_*.yaml files. For more information on how these specifications are formatted, see “Design the API”.

Commonly used commands

Below is a list of common Viash commands to develop and maintain components. Check out this Cheat sheet to get a printable version of the same information. Please visit the Viash guide or reference documentation for more in-depth information.

Run a component

Display help text
viash run path/to/config.vsh.yaml -- --help
Run a component
viash run path/to/config.vsh.yaml -- --input dataset.h5ad --output output.h5ad

Build a component

Generate an executable
viash build path/to/config.vsh.yaml --output bin --setup cachedbuild
Display help text
bin/mycomponent --help
Run executable
bin/mycomponent --input dataset.h5ad --output output.h5ad
Build all components in the Label Projection task
viash ns build --query "label_projection" --setup cachedbuild --parallel

Test a component

Run unit test
viash test path/to/config.vsh.yaml
Run all unit tests in the Label Projection task
viash ns test --query "label_projection" --parallel

Edit script

Update the VIASH STARTVIASH END placeholder
viash config inject path/to/config.vsh.yaml

View component

View config
viash config view path/to/config.vsh.yaml

Common concepts

Loading helper functions from external files

It is possible to add additional resources such as a file containing helper functions or other resources. All you need to do is list those files under the functionality.resources section of your component and refer to them in your script using meta["resources_dir"] + /myresource.txt. Please visit the Viash documentation for concrete examples on how to add helper functions and other resources to your component.

Merging YAMLs (__merge__)

The __merge__ field is used to merge another YAML into a Viash config. One of its uses is in making sure that all of the components in a task has the same API.

Each task in OpenProblems contains strict definitions of the input/output file interface of its components and the file formats of those files. These interfaces are stored as YAML files in the api subdirectory of each task.

References

Cannoodt, Robrecht, Hendrik Cannoodt, Eric Van de Kerckhove, Andy Boschmans, Dries De Maeyer, and Toni Verbeiren. 2021. “Viash: From Scripts to Pipelines.” arXiv. https://doi.org/10.48550/ARXIV.2110.11494.
Virshup, Isaac, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. 2021. “Anndata: Annotated Data.” https://doi.org/10.1101/2021.12.16.473007.