graph LR loader[/Dataset<br/>loader/]:::component dataset[Dataset]:::anndata method[/Method/]:::component output[Output]:::anndata metric[/Metric/]:::component score[Score]:::anndata loader --> dataset --- method --> output --- metric --> score
Concepts
OpenProblems benchmarking pipelines consist of Viash components and AnnData HDF5 files.
AnnData file format
AnnData, short for “Annotated Data”, is a file format for handling annotated, high-dimensional biological data (Virshup et al. 2021). In the context of OpenProblems, AnnData is used as the standard data format for both input and output files of components. This ensures a consistent and seamless exchange of data between different components of the benchmarking pipelines, allowing developers to focus on the core functionality of their components without worrying about data format compatibility.
X
, e.g. gene expression values), annotations of observations (obs
, e.g. cell metadata), annotations of variables (var
, e.g. gene metadata), and unstructured annotations (uns
). This organization makes it easy to work with complex datasets while maintaining data integrity and ensuring a standardized structure across different components.Files with the .h5ad
extension represent AnnData objects stored in an HDF5 file. AnnData objects can be opened in Python using the anndata.read_h5ad()
function, and in R using the anndata::read_h5ad()
function. Technically it can be read in any language using an HDF5 library.
Common dataset format
In OpenProblems, most of the AnnData datasets will have a subset of the following slots:
AnnData object
obs: 'celltype', 'batch', 'tissue', 'size_factors'
var: 'hvg', 'hvg_score'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
varm: 'pca_loadings'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'pca_variance', 'knn'
In OpenProblems, the X slot in the AnnData objects is typically not defined (None
in Python, NULL
in R). Instead, the raw counts and normalised expression data are defined as layers.
Please visit the reference documentation on the common dataset file format used by OpenProblems for more information on each of the different slots.
Dataset format specifications
OpenProblems tasks contain specifications for the required and optional slots in AnnData files. These specifications are defined in the src/tasks/*/api/file_*.yaml
. For more information on how these specifications are formatted, see “Design the API”.
Viash component
A Viash component is a combination of a code block or script and a small amount of metadata that makes it easy to generate pipeline modules, facilitating the separation of component functionality from the pipeline workflow (Cannoodt et al. 2021). This enables developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.
A Viash component consists of three main parts: a Viash config, a script, and one or more unit tests (Figure 3).
Example component
When creating a new Viash component, you will need to create a script and a config file. Below is an example of a simple Viash component written in Python. For a detailed explanation of the structure of a Viash component in more programming languages (e.g. Bash, Python, R), please visit review documentation on how to create a Viash component.
Script
To help with prototyping, a Viash component’s script should be runnable. You can do this by defining a Viash placeholder code block denoted by the ## VIASH START
and ## VIASH END
comments.
The Viash placeholder should contain a dictionary (or named list) which contains example values for all of the input/output arguments your component needs. During execution, Viash will remove the placeholder and replace it with the actual parameter values at runtime.
script.py
import anndata
## VIASH START
# Note: This codeblock is for prototyping and
# is removed by Viash at runtime.
= {
par 'input': 'test_resource.txt',
'output': 'output.txt'
}## VIASH END
# Print par
print(f"par: {par}")
# Read input file
= anndata.read_h5ad(par["input"])
adata
# Print adata
print(f"adata: {adata}")
# Write output file
"output"]) adata.write_h5ad(par[
It’s possible to store helper functions in separate files. Visit the Viash documentation for more information.
Viash config
Next, we’ll write a Viash config file for this component.
config.vsh.yaml
functionality:
name: mycomponent
description: |
A multiline description. arguments:
- name: "--input"
type: file
description: Input h5ad
example: input.h5ad
required: true
- name: "--output"
type: file
direction: output
description: Output g5ad
example: output.h5ad
required: true
resources:
- type: python_script
path: script.py
tests:
- type: python_script
path: test.py
platforms:
- type: docker
image: "python:3.10"
setup:
- type: python
pypi: anndata
- type: native
- type: nextflow
After adding more arguments to the config, you can update your script using the viash config inject
Component format specifications
OpenProblems tasks contain specifications for the common config fields between the same component types. These specifications are defined in the src/tasks/*/comp_*.yaml
files. For more information on how these specifications are formatted, see “Design the API”.
Commonly used commands
Below is a list of common Viash commands to develop and maintain components. Check out this Cheat sheet to get a printable version of the same information. Please visit the Viash guide or reference documentation for more in-depth information.
Run a component
Display help text
viash run path/to/config.vsh.yaml -- --help
Run a component
viash run path/to/config.vsh.yaml -- --input dataset.h5ad --output output.h5ad
Build a component
Generate an executable
viash build path/to/config.vsh.yaml --output bin --setup cachedbuild
Display help text
bin/mycomponent --help
Run executable
bin/mycomponent --input dataset.h5ad --output output.h5ad
Build all components in the Label Projection task
viash ns build --query "label_projection" --setup cachedbuild --parallel
Test a component
Run unit test
viash test path/to/config.vsh.yaml
Run all unit tests in the Label Projection task
viash ns test --query "label_projection" --parallel
Edit script
Update the VIASH START
–VIASH END
placeholder
viash config inject path/to/config.vsh.yaml
View component
View config
viash config view path/to/config.vsh.yaml
Common concepts
Loading helper functions from external files
It is possible to add additional resources such as a file containing helper functions or other resources. All you need to do is list those files under the functionality.resources
section of your component and refer to them in your script using meta["resources_dir"] + /myresource.txt
. Please visit the Viash documentation for concrete examples on how to add helper functions and other resources to your component.
Merging YAMLs (__merge__
)
The __merge__
field is used to merge another YAML into a Viash config. One of its uses is in making sure that all of the components in a task has the same API.
Each task in OpenProblems contains strict definitions of the input/output file interface of its components and the file formats of those files. These interfaces are stored as YAML files in the api
subdirectory of each task.