Add a method
OpenProblems has been build with Viash components including the methods of a task. A Viash component consists of a script and a Viash config. The config defines the documentation, input/output arguments and dependencies of the script. This page describes how to add a method to an existing task.
Make sure you have followed the Requirements and getting started pages.
This guide will explain how to add a new method for the Label projection task. Every time you encounter the string label_projection
, replace it with your task of interest.
Create a new component
Create a new Viash component by running the following command:
viash run src/common/create_skeleton/config.vsh.yaml -- \
--task label_projection \
--comp_type method \
--name my_method \
--language python
This will create a new folder at src/label_projection/methods/my_method
containing a Viash config (config.vsh.yaml
) and a script (script.py
or script.R
).
src/label_projection/methods/my_method
├── script.py/R method script
├── config.vsh.yaml config file for method
└── additional files Helper files like e.g. tsv file, unit test specific for method, ...
config.vsh.yaml
Full documentation on the Viash configuration file is available on the Viash documentation site.
Merge
__merge__: ../../api/comp_method.yaml
This file contains metadata that is needed for all the methods. It will contain the required arguments such as the --input
files and the --output
files that are common among all the methods in the task.
functionality:
arguments:
- name: "--input_train"
__merge__: anndata_train.yaml
- name: "--input_test"
__merge__: anndata_test.yaml
- name: "--output"
__merge__: anndata_prediction.yaml
direction: output
test_resources:
- path: ../../../../resources_test/label_projection/pancreas
- type: python_script
path: generic_test.py
text: |
import anndata as ad
import subprocess
from os import path...
You can find more in depth information here.
Functionality
This section of the configuration file contains information about the metadata of the script including script specific parameters and a list of resource files.
functionality:
# a unique name for your method, same as what is being output by the script.
# must match the regex [a-z][a-z0-9_]*
name: my_method
namespace: label_projection/methods
# metadata for your method
description: A description for your method.
info:
type: method
method_name: My Method
preferred_normalization: log_cpm
variants:
my_method:
method_variant:
# component parameters
arguments:
# Method-specific parameters.
# Change these to expose parameters of your method to Nextflow (optional)
- name: "--n_neighbors"
type: "integer"
default: 5
description: Number of neighbors to use.
# files your script needs
resources:
# the script itself
- type: python_script
path: script.py
# additional resources your script needs (optional)
- type: file
path: weights.pt
In this section of the configuration you should focus on updating the following sections:
- Description and Info - Information about the method.
- Arguments - Each section here defines a command-line argument for the script. These sections are all passed to the script in the form of a dictionary called
par
. You only need to add the method-specific parameters. If no additional arguments are required then the ones provdided in__merge__
file above then you can remove this section entirely - Resources - This section describes the files that need to be included in your component. For example if you’d like to add a file containing model weights called
weights.pt
, add{ type: file, path: weights.pt }
to the resources. You can now load the additional resource in your script by using at the pathmeta['resources_dir'] + '/weights.pt'
.
Platform
The Platform section defines the information about how the Viash component is run on various backend platforms.
# target platforms
platforms:
# By specifying 'docker' platform, viash will build a standalone
# executable which uses docker in the back end to run your method.
- type: docker
# you need to specify a base image that contains at least bash and python
image: python:3.10
# You can specify additional dependencies with 'setup'.
setup:
# - type: apt
# packages:
# - bash
- type: python
pip:
- pyyaml
- anndata>=0.8
# By specifying a 'nextflow', viash will also build a viash module
# which uses the docker container built above to also be able to
# run your method as part of a nextflow pipeline.
- type: nextflow
directives:
label: [ lowmem, lowcpu ]
# target platforms
platforms:
# By specifying 'docker' platform, viash will build a standalone
# executable which uses docker in the back end to run your method.
- type: docker
# you need to specify a base image that contains at least bash and python
image: eddelbuettel/r2u:22.04
# You can specify additional dependencies with 'setup'.
setup:
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3, git ]
- type: python
pip: [ anndata>=0.8, pyyaml ]
- type: r
cran: [ anndata]
# By specifying a 'nextflow', viash will also build a viash module
# which uses the docker container built above to also be able to
# run your method as part of a nextflow pipeline.
- type: nextflow
directives:
label: [ lowmem, lowcpu ]
The most important part of this section to update is the setup
definition that describes the packages that need to be installed in the docker container for the method to run. There are many different methods for specifying these requirements described in the Viash docs. It is required to add the python setup to include the pyyaml
package due to general unit testing done (if not already in the image). When creating an Rscript also add the anndata>=0.8
package in the python setup.
You can also change the memory and CPU utilization be editing the Nextflow labels section. Available options are [low|med|high]
for each of mem
and cpu
. The corresponding resource values can be found in the /src/wf_utils/labels.config
file.
Tip: After making changes to the components dependencies, you will need to rebuild the docker container as follows:
$ viash run -- ---setup cachedbuild
[notice] Running 'docker build -t method:dev /tmp/viashsetupdocker-method-tEX78c'
Tip #2: You can view the dockerfile that Viash generates from the config file using the ---dockerfile
argument:
$ viash run -- ---dockerfile
script file
The script has three main sections: Imports/libraries, Viash block, and Method.
Imports
This section defines which packages the method expects, if you want to import a new different package, add the import
statement here and add the dependency to config.vsh.yaml
(see above).
import anndata as ad
library(anndata, warn.conflicts = FALSE)
Viash block
This optional code block exists to facilitate prototyping so your script can run when called directly by running python script.py
(or Rscript script.R
for R users).
## VIASH START
# Anything within this block will be removed by `viash` and will be
# replaced with the parameters as specified in your config.vsh.yaml.
= {
par # Required arguments for the task
'input_train': 'train.h5ad',
'input_test': '.test.h5ad',
'output': 'output.h5ad',
# Optional method-specific arguments
'n_neighbors': 5,
}= {
meta 'functionality_name': 'foo'
}## VIASH END
## VIASH START
# Anything within this block will be removed by `viash` and will be
# replaced with the parameters as specified in your config.vsh.yaml.
<- list(
par # Required arguments for the task
input_train= 'train.h5ad',
input_test= 'test_mod1.h5ad',
output= 'output.h5ad',
# Optional method-specific arguments
n_neighbors= 5,
)<- list (
meta functionality_name= 'foo'
)## VIASH END
Here, the par
dictionary contains all the arguments
defined in the config.vsh.yaml
file. Including those from the defined __merge__
file.
Method
This code block will typically consist of reading the input files, performing some preprocessing, training a model on the train cells, generating predictions for the test cells, and outputting the predictions as an AnnData file.
## Data reader
print('Reading input files', flush=True)
= ad.read_h5ad(par['input_train_mod1'])
input_train = ad.read_h5ad(par['input_test_mod1'])
input_test
print('processing Data', flush=True)
# ... preprocessing ...
# ... train model ...
# ... generate predictions ...
# write output to file
= ad.AnnData(
adata =y_pred,
X={
uns'dataset_id': input_train.uns['dataset_id'],
'method_id': meta['functionality_name'],
},
)
print('writing to output files', flush=True)
'output'], compress='gzip') adata.write_h5ad(par[
Depending on the task The output is stored in different locations in the anndata. e.g. for label_projection
it is located in the .obs["label_pred"]
and for dimensionality_reduction
it is stored in .obsm[X_emb]
. You will be able to find this information in the anndata_*.yaml
file determined in the output
argument of the comp_method.yaml
file.
For the label_projection
this will be anndata_predictions.yaml
type: file
description: "The prediction file"
example: "prediction.h5ad"
info:
short_description: "Prediction"
slots:
obs:
- type: string
name: label_pred
description: Predicted labels for the test cells.
uns:
- type: string
name: dataset_id
description: "A unique identifier for the dataset"
required: true
- type: string
name: normalization_id
description: "Which normalization was used"
required: true
- type: string
name: method_id
description: "A unique identifier for the method"
API
For in depth documentation see API file formats and API components specs.
In the API directory there are yaml files that have info about the anndata objects. These files all start with anndata_*.yaml
.
When developing a method it can be useful to check these files on what the anndata objects consist of. For instance the anndata_dataset.yaml
has information on what is required that the datasets has.
If a new field needs to be added…
Nextflow
After developing your method you add it to the nextflow workflow that can be found at task_name/workflows/run/main.nf
Execute the cmd below to create and build the docker and nextflow filepaths.
viash ns build -q task_name --paralell --setup cachedbuild
// import control methods
include { true_labels } from "$targetDir/label_projection/control_methods/true_labels/main.nf"
include { majority_vote } from "$targetDir/label_projection/control_methods/majority_vote/main.nf"
include { random_labels } from "$targetDir/label_projection/control_methods/random_labels/main.nf"
// import methods
include { knn } from "$targetDir/label_projection/methods/knn/main.nf"
include { mlp } from "$targetDir/label_projection/methods/mlp/main.nf"
include { logistic_regression } from "$targetDir/label_projection/methods/logistic_regression/main.nf"
include { scanvi } from "$targetDir/label_projection/methods/scanvi/main.nf"
include { seurat_transferdata } from "$targetDir/label_projection/methods/seurat_transferdata/main.nf"
include { xgboost } from "$targetDir/label_projection/methods/xgboost/main.nf"
Also add your method further down the file with the include name you have given.
// construct a map of methods (id -> method_module)
methods = [ true_labels, majority_vote, random_labels, knn, mlp, logistic_regression, scanvi, seurat_transferdata, xgboost ]
.collectEntries{method ->
[method.config.functionality.name, method]
}
testing
Check out the in depth documentation here
unit test
You can test your method by using the following command:
viash test path/to/method/config.vsh.yaml
There is a general unit test that you can find in the comp_method.yaml
that will be executed. If you added a specific unit test for your method it will also be executed if added correctly.
Depending on the result you will get a notification on how many tests succeeded or failed:
SUCCESS! All 1 out of 1 test scripts succeeded!
Workflow test
testing of the full workflow can be done by using the following command:
task_name/workflows/run/run_test.sh
Final steps
Add yourself to the task_name/api/authors.yaml
file.
When you are finished with your component create a Pull Request according to the instructions here.
alternative methods
There is also a possibility to add a control method to the the task. These methods form the baseline that will be used to compare the methods from the same task against to see how they perform. These controls
can divided in 2 type:
- negative control: These methods contain no or a random prediction. Which will make these have a bad result when performing metrics.
- Positive control: These methods contain the ground truth. This ensures they have the good result when performing the metrics.
For most of the time these methods are added in the same way as the methods above. The differences will be shown below.
Create a new component
Create a new Viash component by running the following command:
viash run src/common/create_skeleton/config.vsh.yaml -- \
--task label_projection \
--comp-type negative_control \
--name my_method \
--language python
This will create a new folder at src/label_projection/control_methods/my_method
. You will need to change the --comp-type
to postivie control depending on wich typ you want to add.
src/label_projection/control_methods/my_method
├── script.py/R method script
├── config.vsh.yaml config file for method
└── additional files Helper files like e.g. tsv file, unit test specific for method, ...
config.vsh.yaml
The main difference:
namespace
->control_methods
info/type
: This will benegative_control
orpositive_control
.
__merge__: ../../api/comp_control_method.yaml
functionality:
name: my_method
namespace: label_projection/control_methods
description: A description for your method.
info:
type: negative_control
method_name: My Method
variants:
my_method:
preferred_normalization: counts
resources:
- type: python_script
path: script.py
platforms:
- type: docker
image: "python:3.8"
setup:
- type: python
packages:
- "anndata>=0.8"
- type: nextflow
directives:
label: [ lowmem, lowcpu ]