Add a method

OpenProblems has been build with Viash components including the methods of a task. A Viash component consists of a script and a Viash config. The config defines the documentation, input/output arguments and dependencies of the script. This page describes how to add a method to an existing task.

Tip

Make sure you have followed the Requirements and getting started pages.

Note

This guide will explain how to add a new method for the Label projection task. Every time you encounter the string label_projection, replace it with your task of interest.

Create a new component

Create a new Viash component by running the following command:

viash run src/common/create_skeleton/config.vsh.yaml -- \
  --task label_projection \
  --comp_type method \
  --name my_method \
  --language python

This will create a new folder at src/label_projection/methods/my_method containing a Viash config (config.vsh.yaml) and a script (script.py or script.R).

src/label_projection/methods/my_method
    ├── script.py/R                  method script
    ├── config.vsh.yaml              config file for method
    └── additional files             Helper files like e.g. tsv file, unit test specific for method, ...

config.vsh.yaml

Full documentation on the Viash configuration file is available on the Viash documentation site.

Merge

__merge__: ../../api/comp_method.yaml

This file contains metadata that is needed for all the methods. It will contain the required arguments such as the --input files and the --output files that are common among all the methods in the task.

functionality:
  arguments:
    - name: "--input_train"
      __merge__: anndata_train.yaml
    - name: "--input_test"
      __merge__: anndata_test.yaml
    - name: "--output"
      __merge__: anndata_prediction.yaml
      direction: output
  test_resources:
    - path: ../../../../resources_test/label_projection/pancreas
    - type: python_script
      path: generic_test.py
      text: |
        import anndata as ad
        import subprocess
        from os import path
...

You can find more in depth information here.

Functionality

This section of the configuration file contains information about the metadata of the script including script specific parameters and a list of resource files.

functionality:
  # a unique name for your method, same as what is being output by the script.
  # must match the regex [a-z][a-z0-9_]*
  name: my_method
  namespace: label_projection/methods
  # metadata for your method
  description: A description for your method.
  info:
    type: method
    method_name: My Method
    preferred_normalization: log_cpm
    variants:
      my_method:
      method_variant:


  # component parameters
  arguments:
    # Method-specific parameters. 
    # Change these to expose parameters of your method to Nextflow (optional)
    - name: "--n_neighbors"
      type: "integer"
      default: 5
      description: Number of neighbors to use.

  # files your script needs
  resources:
    # the script itself
    - type: python_script
      path: script.py
    # additional resources your script needs (optional)
    - type: file
      path: weights.pt

In this section of the configuration you should focus on updating the following sections:

  1. Description and Info - Information about the method.
  2. Arguments - Each section here defines a command-line argument for the script. These sections are all passed to the script in the form of a dictionary called par. You only need to add the method-specific parameters. If no additional arguments are required then the ones provdided in __merge__ file above then you can remove this section entirely
  3. Resources - This section describes the files that need to be included in your component. For example if you’d like to add a file containing model weights called weights.pt, add { type: file, path: weights.pt } to the resources. You can now load the additional resource in your script by using at the path meta['resources_dir'] + '/weights.pt'.

Platform

The Platform section defines the information about how the Viash component is run on various backend platforms.

  1. Docker (docs)
  2. Nextflow (docs)
# target platforms
platforms:

  # By specifying 'docker' platform, viash will build a standalone
  # executable which uses docker in the back end to run your method.
  - type: docker
    # you need to specify a base image that contains at least bash and python
    image: python:3.10
    # You can specify additional dependencies with 'setup'.
    setup:
      # - type: apt
      #   packages:
      #     - bash
      - type: python
        pip:
          - pyyaml
          - anndata>=0.8

  # By specifying a 'nextflow', viash will also build a viash module
  # which uses the docker container built above to also be able to
  # run your method as part of a nextflow pipeline.
  - type: nextflow
    directives:
      label: [ lowmem, lowcpu ]
# target platforms
platforms:

  # By specifying 'docker' platform, viash will build a standalone
  # executable which uses docker in the back end to run your method.
  - type: docker
    # you need to specify a base image that contains at least bash and python
    image: eddelbuettel/r2u:22.04
    # You can specify additional dependencies with 'setup'.
    setup:
      - type: apt
        packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3, git ]
      - type: python
        pip: [ anndata>=0.8, pyyaml ]
      - type: r
        cran: [ anndata]

  # By specifying a 'nextflow', viash will also build a viash module
  # which uses the docker container built above to also be able to
  # run your method as part of a nextflow pipeline.
  - type: nextflow
    directives:
      label: [ lowmem, lowcpu ]

The most important part of this section to update is the setup definition that describes the packages that need to be installed in the docker container for the method to run. There are many different methods for specifying these requirements described in the Viash docs. It is required to add the python setup to include the pyyaml package due to general unit testing done (if not already in the image). When creating an Rscript also add the anndata>=0.8 package in the python setup.

You can also change the memory and CPU utilization be editing the Nextflow labels section. Available options are [low|med|high] for each of mem and cpu. The corresponding resource values can be found in the /src/wf_utils/labels.config file.

Note

Tip: After making changes to the components dependencies, you will need to rebuild the docker container as follows:

$ viash run -- ---setup cachedbuild
[notice] Running 'docker build -t method:dev /tmp/viashsetupdocker-method-tEX78c'
Note

Tip #2: You can view the dockerfile that Viash generates from the config file using the ---dockerfile argument:

$ viash run -- ---dockerfile

script file

The script has three main sections: Imports/libraries, Viash block, and Method.

Imports

This section defines which packages the method expects, if you want to import a new different package, add the import statement here and add the dependency to config.vsh.yaml (see above).

import anndata as ad
library(anndata, warn.conflicts = FALSE)

Viash block

This optional code block exists to facilitate prototyping so your script can run when called directly by running python script.py (or Rscript script.R for R users).

## VIASH START
# Anything within this block will be removed by `viash` and will be
# replaced with the parameters as specified in your config.vsh.yaml.
par = {
    # Required arguments for the task
    'input_train': 'train.h5ad',
    'input_test': '.test.h5ad',
    'output': 'output.h5ad',
    # Optional method-specific arguments
    'n_neighbors': 5,
}
meta = { 
  'functionality_name': 'foo' 
}
## VIASH END
## VIASH START
# Anything within this block will be removed by `viash` and will be
# replaced with the parameters as specified in your config.vsh.yaml.
par <- list(
    # Required arguments for the task
    input_train= 'train.h5ad',
    input_test= 'test_mod1.h5ad',
    output= 'output.h5ad',
    # Optional method-specific arguments
    n_neighbors= 5,
)
meta <- list (
  functionality_name= 'foo' 
)
## VIASH END

Here, the par dictionary contains all the arguments defined in the config.vsh.yaml file. Including those from the defined __merge__ file.

Method

This code block will typically consist of reading the input files, performing some preprocessing, training a model on the train cells, generating predictions for the test cells, and outputting the predictions as an AnnData file.

## Data reader
print('Reading input files', flush=True)

input_train = ad.read_h5ad(par['input_train_mod1'])
input_test = ad.read_h5ad(par['input_test_mod1'])

print('processing Data', flush=True)
# ... preprocessing ... 
# ... train model ...
# ... generate predictions ...

# write output to file
adata = ad.AnnData(
    X=y_pred,
    uns={
        'dataset_id': input_train.uns['dataset_id'],
        'method_id': meta['functionality_name'],
    },
)

print('writing to output files', flush=True)
adata.write_h5ad(par['output'], compress='gzip')

Depending on the task The output is stored in different locations in the anndata. e.g. for label_projection it is located in the .obs["label_pred"] and for dimensionality_reduction it is stored in .obsm[X_emb]. You will be able to find this information in the anndata_*.yaml file determined in the output argument of the comp_method.yaml file.

For the label_projection this will be anndata_predictions.yaml

type: file
description: "The prediction file"
example: "prediction.h5ad"
info:
  short_description: "Prediction"
  slots:
    obs:
      - type: string
        name: label_pred
        description: Predicted labels for the test cells.
    uns:
      - type: string
        name: dataset_id
        description: "A unique identifier for the dataset"
        required: true
      - type: string
        name: normalization_id
        description: "Which normalization was used"
        required: true
      - type: string
        name: method_id
        description: "A unique identifier for the method"

API

For in depth documentation see API file formats and API components specs.

In the API directory there are yaml files that have info about the anndata objects. These files all start with anndata_*.yaml.

When developing a method it can be useful to check these files on what the anndata objects consist of. For instance the anndata_dataset.yaml has information on what is required that the datasets has.

If a new field needs to be added…

Nextflow

After developing your method you add it to the nextflow workflow that can be found at task_name/workflows/run/main.nf

Execute the cmd below to create and build the docker and nextflow filepaths.

viash ns build -q task_name --paralell --setup cachedbuild
// import control methods
include { true_labels } from "$targetDir/label_projection/control_methods/true_labels/main.nf"
include { majority_vote } from "$targetDir/label_projection/control_methods/majority_vote/main.nf"
include { random_labels } from "$targetDir/label_projection/control_methods/random_labels/main.nf"

// import methods
include { knn } from "$targetDir/label_projection/methods/knn/main.nf"
include { mlp } from "$targetDir/label_projection/methods/mlp/main.nf"
include { logistic_regression } from "$targetDir/label_projection/methods/logistic_regression/main.nf"
include { scanvi } from "$targetDir/label_projection/methods/scanvi/main.nf"
include { seurat_transferdata } from "$targetDir/label_projection/methods/seurat_transferdata/main.nf"
include { xgboost } from "$targetDir/label_projection/methods/xgboost/main.nf"

Also add your method further down the file with the include name you have given.

// construct a map of methods (id -> method_module)
methods = [ true_labels, majority_vote, random_labels, knn, mlp, logistic_regression, scanvi, seurat_transferdata, xgboost ]
  .collectEntries{method ->
    [method.config.functionality.name, method]
  }

testing

Check out the in depth documentation here

unit test

You can test your method by using the following command:

viash test path/to/method/config.vsh.yaml

There is a general unit test that you can find in the comp_method.yaml that will be executed. If you added a specific unit test for your method it will also be executed if added correctly.

Depending on the result you will get a notification on how many tests succeeded or failed:

SUCCESS! All 1 out of 1 test scripts succeeded!

Workflow test

testing of the full workflow can be done by using the following command:

task_name/workflows/run/run_test.sh

Final steps

Add yourself to the task_name/api/authors.yaml file.

When you are finished with your component create a Pull Request according to the instructions here.

alternative methods

There is also a possibility to add a control method to the the task. These methods form the baseline that will be used to compare the methods from the same task against to see how they perform. These controls can divided in 2 type:

  • negative control: These methods contain no or a random prediction. Which will make these have a bad result when performing metrics.
  • Positive control: These methods contain the ground truth. This ensures they have the good result when performing the metrics.

For most of the time these methods are added in the same way as the methods above. The differences will be shown below.

Create a new component

Create a new Viash component by running the following command:

viash run src/common/create_skeleton/config.vsh.yaml -- \
  --task label_projection \
  --comp-type negative_control \
  --name my_method \
  --language python

This will create a new folder at src/label_projection/control_methods/my_method. You will need to change the --comp-type to postivie control depending on wich typ you want to add.

src/label_projection/control_methods/my_method
    ├── script.py/R                  method script
    ├── config.vsh.yaml              config file for method
    └── additional files             Helper files like e.g. tsv file, unit test specific for method, ...

config.vsh.yaml

The main difference:

  1. namespace -> control_methods
  2. info/type: This will be negative_control or positive_control.
__merge__: ../../api/comp_control_method.yaml
functionality:
  name: my_method
  namespace: label_projection/control_methods
  description: A description for your method.
  info:
    type: negative_control
    method_name: My Method
    variants:
      my_method:
    preferred_normalization: counts
  resources:
    - type: python_script
      path: script.py
platforms:
  - type: docker
    image: "python:3.8"
    setup:
      - type: python
        packages:
          - "anndata>=0.8"
  - type: nextflow
    directives: 
      label: [ lowmem, lowcpu ]