Add a method

This guide will show you how to add a new method to the pipeline.

A method is a technique to solve a specific problem when analysing omics data. Its performance is assessed by comparing it to other methods and control methods.

This guide will show you how to create a new Viash component. In the following we will show examples for both Python and R. Note that the Label Projection task is used throughout the guide, so make sure to replace any occurrences of "label_projection" with your task of interest.

Tip

Make sure you have followed the “Getting started” guide.

Step 1: Create a new component

Use the create_component component to start creating a new method.

viash run src/common/create_component/config.vsh.yaml -- \
  --task label_projection \
  --type method \
  --name my_method_py \
  --language python
Check inputs
Check language
Check API file
Read API file
Create output dir
Create config
Create script
Done!

This will create a new folder at src/tasks/label_projection/methods/my_method_py containing a Viash config and a script.

src/tasks/label_projection/methods/my_method_py
    ├── script.py                    Script for running the method.
    ├── config.vsh.yaml              Config file for method.
    └── ...                          Optional additional resources.
viash run src/common/create_component/config.vsh.yaml -- \
  --task label_projection \
  --type method \
  --name my_method_r \
  --language r
Check inputs
Check language
Check API file
Read API file
Create output dir
Create config
Create script
Done!

This will create a new folder at src/tasks/label_projection/methods/my_method_r containing a Viash config and a script.

src/tasks/label_projection/methods/my_method_r
    ├── script.R                     Script for running the method.
    ├── config.vsh.yaml              Config file for method.
    └── ...                          Optional additional resources.
Tip

Some tasks have multiple method subtypes (e.g. batch_integration), which will require you to use a different value for --type corresponding to the desired method subtype.

Change the --name to a unique name for your method. It must match the regex [a-z][a-z0-9_]* (snakecase).

  • A config file contains metadata of the component and the dependencies required to run it. In steps 2 and 3 we will fill in the required information.
  • A script contains the code to run the method. In step 4 we will edit the script.

Step 2: Fill in metadata

The Viash config contains metadata of your method, which script is used to run it, and the required dependencies.

Generated config file

This is what the config.vsh.yaml generated by the create_component component looks like:

Contents of config.vsh.yaml
# The API specifies which type of component this is.
# It contains specifications for:
#   - The input/output files
#   - Common parameters
#   - A unit test
__merge__: ../../api/comp_method.yaml

functionality:
  # A unique identifier for your component (required).
  # Can contain only lowercase letters or underscores.
  name: my_method_py

  # Metadata for your component
  info:
    # A relatively short label, used when rendering visualisarions (required)
    label: My Method Py
    # A one sentence summary of how this method works (required). Used when 
    # rendering summary tables.
    summary: "FILL IN: A one sentence summary of this method."
    # A multi-line description of how this component works (required). Used
    # when rendering reference documentation.
    description: |
      FILL IN: A (multi-line) description of how this method works.
    # Which normalisation method this component prefers to use (required).
    preferred_normalization: log_cp10k
    # A reference key from the bibtex library at src/common/library.bib (required).
    reference: bibtex_reference_key
    # URL to the documentation for this method (required).
    documentation_url: https://url.to/the/documentation
    # URL to the code repository for this method (required).
    repository_url: https://github.com/organisation/repository

  # Component-specific parameters (optional)
  # arguments:
  #   - name: "--n_neighbors"
  #     type: "integer"
  #     default: 5
  #     description: Number of neighbors to use.

  # Resources required to run the component
  resources:
    # The script of your component (required)
    - type: python_script
      path: script.py
    # Additional resources your script needs (optional)
    # - type: file
    #   path: weights.pt

platforms:
  # Specifications for the Docker image for this component.
  - type: docker
    image: ghcr.io/openproblems-bio/base_python:1.0.2
    # Add custom dependencies here (optional). For more information, see
    # https://viash.io/reference/config/platforms/docker/#setup .
    # setup:
    #   - type: python
    #     packages: scib==1.1.3

  # This platform allows running the component natively
  - type: native
  # Allows turning the component into a Nextflow module / pipeline.
  - type: nextflow
    directives:
      label: [ "midtime",midmem, midcpu]
Contents of config.vsh.yaml
# The API specifies which type of component this is.
# It contains specifications for:
#   - The input/output files
#   - Common parameters
#   - A unit test
__merge__: ../../api/comp_method.yaml

functionality:
  # A unique identifier for your component (required).
  # Can contain only lowercase letters or underscores.
  name: my_method_r

  # Metadata for your component
  info:
    # A relatively short label, used when rendering visualisarions (required)
    label: My Method R
    # A one sentence summary of how this method works (required). Used when 
    # rendering summary tables.
    summary: "FILL IN: A one sentence summary of this method."
    # A multi-line description of how this component works (required). Used
    # when rendering reference documentation.
    description: |
      FILL IN: A (multi-line) description of how this method works.
    # Which normalisation method this component prefers to use (required).
    preferred_normalization: log_cp10k
    # A reference key from the bibtex library at src/common/library.bib (required).
    reference: bibtex_reference_key
    # URL to the documentation for this method (required).
    documentation_url: https://url.to/the/documentation
    # URL to the code repository for this method (required).
    repository_url: https://github.com/organisation/repository

  # Component-specific parameters (optional)
  # arguments:
  #   - name: "--n_neighbors"
  #     type: "integer"
  #     default: 5
  #     description: Number of neighbors to use.

  # Resources required to run the component
  resources:
    # The script of your component (required)
    - type: r_script
      path: script.R
    # Additional resources your script needs (optional)
    # - type: file
    #   path: weights.pt

platforms:
  # Specifications for the Docker image for this component.
  - type: docker
    image: ghcr.io/openproblems-bio/base_r:1.0.2
    # Add custom dependencies here (optional). For more information, see
    # https://viash.io/reference/config/platforms/docker/#setup .
    # setup:
    #   - type: r
    #     packages: tidyverse

  # This platform allows running the component natively
  - type: native
  # Allows turning the component into a Nextflow module / pipeline.
  - type: nextflow
    directives:
      label: [ "midtime",midmem, midcpu]

Required metadata fields

Please edit functionality.info section in the config file to fill in the necessary metadata.

  • .__merge__: The API specifies which type of component this is. It contains specifications for:

    • The input/output files
    • Common parameters
    • A unit test
  • .functionality.name: A unique identifier. Can only contain lowercase letters, numbers or underscores.

  • .functionality.info.label: A unique, human-readable, short label. Used for creating summary tables and visualisations.

  • .functionality.info.summary: A one sentence summary of purpose and methodology. Used for creating an overview tables.

  • .functionality.info.description: A longer description (one or more paragraphs). Used for creating reference documentation and supplementary information.

  • .functionality.info.preferred_normalization: Which normalization method a component prefers.

    Each value corresponds to a normalization component in the directory src/datasets/normalization.

  • .functionality.info.reference: A bibtex reference key to the paper where the component is described.

  • .functionality.info.documentation_url: The url to the documentation of the used software library.

  • .functionality.info.repository_url: The url to the repository of the used software library.

Step 3: Add dependencies

Each component has it’s own set of dependencies, because different components might have conflicting dependencies.

For your convenience we have created 2 base images that can be used for python or R scripts. These images can be found in the OpenProblems github repo base-images. Click on the packages to view the url you need to use. You are not required to use these images but make sure the required packages are installed to make sure OpenProblems works properly.

Update the setup definition in the platforms section of the config file. This section describes the packages that need to be installed in the Docker image and are required for your method to run.

If you’re using a custom image use the following minimum setup:

platforms:
  - type: docker
    Image: your custom image
    setup:
      - type: apt
        packages:
          - procps
      - type: python
        packages:
          - anndata~=0.8.0
          - scanpy
          - pyyaml
          - requests
          - jsonschema
platforms:
  - type: docker
    Image: your custom image
    setup:
      - type: apt
        packages:
          - procps
          - libhdf5-dev
          - libgeos-dev
          - python3
          - python3-pip
          - python3-dev
          - python-is-python3
      - type: python
        packages:
          - rpy2
          - anndata~=0.8.0
          - scanpy
          - pyyaml
          - requests
          - jsonschema
      - type: r
        packages:
          - anndata
          - BiocManager

Please check out this guide for more information on how to add extra package dependencies.

Note

Tip: After making changes to the components dependencies, you will need to rebuild the docker container as follows:

viash run src/tasks/label_projection/methods/my_method_py/config.vsh.yaml -- \
  ---setup cachedbuild
output
[notice] Building container 'ghcr.io/openproblems-bio/label_projection/methods/my_method_py:dev' with Dockerfile

Step 4: Edit script

A component’s script typically has five sections:

  1. Imports and libraries
  2. Argument values
  3. Read input data
  4. Generate results
  5. Write output data to file

This is what the script generated by the create_component component looks like:

Contents of script.py
import anndata as ad

## VIASH START
# Note: this section is auto-generated by viash at runtime. To edit it, make changes
# in config.vsh.yaml and then run `viash config inject config.vsh.yaml`.
par = {
  'input_train': 'resources_test/label_projection/pancreas/train.h5ad',
  'input_test': 'resources_test/label_projection/pancreas/test.h5ad',
  'output': 'output.h5ad'
}
meta = {
  'functionality_name': 'my_method_py'
}
## VIASH END

print('Reading input files', flush=True)
input_train = ad.read_h5ad(par['input_train'])
input_test = ad.read_h5ad(par['input_test'])

print('Preprocess data', flush=True)
# ... preprocessing ...

print('Train model', flush=True)
# ... train model ...

print('Generate predictions', flush=True)
# ... generate predictions ...

print("Write output AnnData to file", flush=True)
output = ad.AnnData(
  obs={
    'label_pred': obs_label_pred
  },
  uns={
    'dataset_id': input_train.uns['dataset_id'],
    'normalization_id': input_train.uns['normalization_id'],
    'method_id': meta['functionality_name']
  }
)
output.write_h5ad(par['output'], compression='gzip')
Contents of script.R
library(anndata)

## VIASH START
par <- list(
  input_train = "resources_test/label_projection/pancreas/train.h5ad",
  input_test = "resources_test/label_projection/pancreas/test.h5ad",
  output = "output.h5ad"
)
meta <- list(
  functionality_name = "my_method_r"
)
## VIASH END

cat("Reading input files\n")
input_train <- anndata::read_h5ad(par[["input_train"]])
input_test <- anndata::read_h5ad(par[["input_test"]])

cat("Preprocess data\n")
# ... preprocessing ...

cat("Train model\n")
# ... train model ...

cat("Generate predictions\n")
# ... generate predictions ...

cat("Write output AnnData to file\n")
output <- anndata::AnnData(
  uns = list(
    dataset_id = input_train$uns[["dataset_id"]],
    normalization_id = input_train$uns[["normalization_id"]],
    method_id = meta[["functionality_name"]]
  ),
  obs = list(
    label_pred = obs_label_pred
  )
)
output$write_h5ad(par[["output"]], compression = "gzip")

The required sections are explained here in more detail:

a. Imports and libraries

In the top section of the script you can define which packages/libraries the method needs. If you add a new or different package add the dependency to config.vsh.yaml in the setup field (see above).

b. Argument block

The Viash code block is designed to facilitate prototyping, by enabling you to execute directly by running python script.py (or Rscript script.R for R users). Note that anything between “VIASH START” and “VIASH END” will be removed and replaced with a CLI argument parser when the components are being built by Viash.

Here, the par dictionary contains all the arguments defined in the config.vsh.yaml file (including those from the defined __merge__ file). When adding a argument in the par dict also add it to the config.vsh.yaml in the arguments section.

c. Read input data

This section reads any input AnnData files passed to the component.

d. Generate results

This is the most important section of your script, as it defines the core functionality provided by the component. It processes the input data to create results for the particular task at hand.

e. Write output data to file

The output stored in a AnnData object and then written to an .h5ad file. The format is specified by the API file specified in the __merge__ field in the config file.

Step 5: Add resources (optional)

It is possible to add additional resources such as a file containing helper functions or other resources. Please visit this page for more information on how to do this.

Step 6: Try component

Your component’s API file contains the necessary unit tests to check whether your component works and the output is in the correct format.

You can test your component by using the following command:

viash test src/tasks/label_projection/methods/my_method_py/config.vsh.yaml
Output
Running tests in temporary directory: '/tmp/viash_test_knn10923016159816649469'
====================================================================
+/tmp/viash_test_knn10923016159816649469/build_executable/knn ---verbosity 6 ---setup cachedbuild
[notice] Building container 'ghcr.io/openproblems-bio/label_projection/methods/knn:test' with Dockerfile
[info] Running 'docker build -t ghcr.io/openproblems-bio/label_projection/methods/knn:test /tmp/viash_test_knn10923016159816649469/build_executable -f /tmp/viash_test_knn10923016159816649469/build_executable/tmp/dockerbuild-knn-YAMR40/Dockerfile'
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 608B done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for ghcr.io/openproblems-bio/base_python:1.0.2
#3 DONE 0.1s

#4 [1/2] FROM ghcr.io/openproblems-bio/base_python:1.0.2@sha256:65a577a3de37665b7a65548cb33c9153b6881742345593d33fe02919c8d66a20
#4 CACHED

#5 [2/2] RUN pip install --upgrade pip &&   pip install --upgrade --no-cache-dir "scikit-learn" "jsonschema"
#5 0.489 Requirement already satisfied: pip in /usr/local/lib/python3.10/site-packages (23.3.1)
#5 0.581 Collecting pip
#5 0.615   Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
#5 0.627 Downloading pip-24.0-py3-none-any.whl (2.1 MB)
#5 0.667    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 55.3 MB/s eta 0:00:00
#5 1.065 Installing collected packages: pip
#5 1.065   Attempting uninstall: pip
#5 1.066     Found existing installation: pip 23.3.1
#5 1.120     Uninstalling pip-23.3.1:
#5 1.132       Successfully uninstalled pip-23.3.1
#5 2.122 Successfully installed pip-24.0
#5 2.122 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#5 2.573 Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/site-packages (1.3.2)
#5 2.701 Collecting scikit-learn
#5 2.735   Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
#5 2.737 Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/site-packages (4.19.2)
#5 2.770 Collecting jsonschema
#5 2.774   Downloading jsonschema-4.21.1-py3-none-any.whl.metadata (7.8 kB)
#5 2.802 Requirement already satisfied: numpy<2.0,>=1.19.5 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.26.1)
#5 2.803 Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.11.3)
#5 2.804 Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (1.3.2)
#5 2.804 Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/site-packages (from scikit-learn) (3.2.0)
#5 2.820 Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/site-packages (from jsonschema) (23.1.0)
#5 2.820 Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/site-packages (from jsonschema) (2023.7.1)
#5 2.821 Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/site-packages (from jsonschema) (0.30.2)
#5 2.822 Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/site-packages (from jsonschema) (0.12.0)
#5 2.871 Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
#5 2.970    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 178.4 MB/s eta 0:00:00
#5 2.974 Downloading jsonschema-4.21.1-py3-none-any.whl (85 kB)
#5 2.976    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.5/85.5 kB 243.8 MB/s eta 0:00:00
#5 3.380 Installing collected packages: scikit-learn, jsonschema
#5 3.380   Attempting uninstall: scikit-learn
#5 3.381     Found existing installation: scikit-learn 1.3.2
#5 3.452     Uninstalling scikit-learn-1.3.2:
#5 3.462       Successfully uninstalled scikit-learn-1.3.2
#5 5.057   Attempting uninstall: jsonschema
#5 5.058     Found existing installation: jsonschema 4.19.2
#5 5.063     Uninstalling jsonschema-4.19.2:
#5 5.067       Successfully uninstalled jsonschema-4.19.2
#5 5.138 Successfully installed jsonschema-4.21.1 scikit-learn-1.4.0
#5 5.139 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#5 DONE 5.4s

#6 exporting to image
#6 exporting layers
#6 exporting layers 2.4s done
#6 writing image sha256:3bcccdbbeb26a74700520210020be5cfee8e85f4476cf40bd0a75c29263c6cfb done
#6 naming to ghcr.io/openproblems-bio/label_projection/methods/knn:test done
#6 DONE 2.4s
====================================================================
+/tmp/viash_test_knn10923016159816649469/test_check_method_config/test_executable
Load config data
Check general fields
Check info fields
Check platform fields
All checks succeeded!
====================================================================
+/tmp/viash_test_knn10923016159816649469/test_run_and_check_adata/test_executable
>> Running test 'run'
>> Checking whether input files exist
>> Running script as test
Load input data
Fit to train data
Predict on test data
Write output to file
>> Checking whether output file exists
>> Reading h5ad files and checking formats
Reading and checking input_train
  AnnData object with n_obs × n_vars = 326 × 500
    obs: 'label', 'batch'
    var: 'hvg', 'hvg_score'
    uns: 'dataset_id', 'normalization_id'
    obsm: 'X_pca'
    layers: 'counts', 'normalized'
Reading and checking input_test
  AnnData object with n_obs × n_vars = 174 × 500
    obs: 'batch'
    var: 'hvg', 'hvg_score'
    uns: 'dataset_id', 'normalization_id'
    obsm: 'X_pca'
    layers: 'counts', 'normalized'
Reading and checking output
  AnnData object with n_obs × n_vars = 174 × 500
    obs: 'batch', 'label_pred'
    var: 'hvg', 'hvg_score'
    uns: 'dataset_id', 'method_id', 'normalization_id'
    obsm: 'X_pca'
    layers: 'counts', 'normalized'
All checks succeeded!
====================================================================
SUCCESS! All 2 out of 2 test scripts succeeded!
Cleaning up temporary directory

Visit “Run tests” for more information on running unit tests and how to interpret common error messages.

You can also run your component on local files using the viash run command. For example:

viash run src/tasks/label_projection/methods/my_method_py/config.vsh.yaml -- \
  --input_train resources_test/label_projection/pancreas/train.h5ad \
  --input_test resources_test/label_projection/pancreas/test.h5ad \
  --output output.h5ad

Next steps

If your component works, please create a pull request.