Dataset processor

In this section, we’ll create the Dataset processor component and generate the first test resources. The Dataset processor component uses one of the pre-defined Common datasets to create the AnnData files used in your task. These AnnData object can then serve as the first test resources to be used to test Method and Metric components.

Why?

The Dataset processor component splits up the Common dataset into different AnnData objects such that both Method and Metric components. This is done so the Method and Metric can only observe the information specified by the file format specification files. This not only safe-guards against data leakage, but also ensures that the Method component can’t accidentally make changes to the data structure that the Metric component relies on.

How?

The Dataset processor, like any component in OpenProblems, is a Viash component, which consists of a config and a script.

Step 1: Create the Viash config

Start by creating the Viash config. Luckily, we already created its file format and argument list in the previous section, so we can just import the interface using the __merge__ field:

src/tasks/<task_id>/process_dataset/config.vsh.yaml
1__merge__: ../api/comp_process_dataset.yaml
functionality:
2  name: "process_dataset"
3  resources:
4    - type: python_script
      path: script.py
5    - path: /src/common/helper_functions/subset_anndata.py
platforms:
6  - type: docker
    image: "python:3.10"
    setup:
      - type: python
        packages:
          - pyyaml
          - "anndata~=0.8.0"
7  - type: nextflow
    directives: 
      label: [ highmem, highcpu ]
1
The interface to use for this component.
2
The name of the component. In this case, this should always be set to process_dataset.
3
Resources are files that provide the logic of the component.
4
We’ll create the Python script in the next step.
5
Optional helper functions you can use in your script.
6
A Docker platform to make the component reproducible.
7
A Nextflow platform to allow turning this component into a Nextflow module (and also standalone pipeline).
src/tasks/<task_id>/process_dataset/config.vsh.yaml
1__merge__: ../api/comp_process_dataset.yaml
functionality:
2  name: "process_dataset"
3  resources:
4    - type: r_script
      path: script.R
5    - path: /src/common/helper_functions/subset_anndata.R
platforms:
6  - type: docker
    image: "eddelbuettel/r2u:22.04"
    setup:
      - type: r
        cran: [ anndata, tidyverse ]
      - type: apt
        packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3 ]
      - type: python
        pip: [ anndata~=0.8.0 ]
7  - type: nextflow
    directives: 
      label: [ highmem, highcpu ]
1
The interface to use for this component.
2
The name of the component. In this case, this should always be set to process_dataset.
3
Resources are files that provide the logic of the component.
4
We’ll create the Python script in the next step.
5
(Optional) Helper functions you can use in your script.
6
A Docker platform to make the component reproducible.
7
A Nextflow platform to allow turning this component into a Nextflow module (and also standalone pipeline).

Step 2: Create the script

Next, create the script which will split the common dataset into a train

src/tasks/<task_id>/process_dataset/script.py
import anndata as ad

## VIASH START
par = {
    "input": "resources_test/common/pancreas/dataset.h5ad",
    "output_dataset": "dataset.h5ad",
    "output_solution": "solution.h5ad",
}
## VIASH END

print(">> Load common dataset", flush=True)
adata = ad.read_h5ad(par["input"])

# <- implement logic for splitting the common dataset
#    into a dataset and a solution object here.

print(">> Create dataset for methods", flush=True)
output_dataset = ad.AnnData(
  # <- add the data you wish to store in the output_dataset object here
)

print(">> Create solution object for metrics", flush=True)
output_solution = ad.AnnData(
  # <- add the data you wish to store in the output_solution object here
)

print(">> Write to disk", flush=True)
output_dataset.write_h5ad(par["output_dataset"])
output_solution.write_h5ad(par["output_solution"])
src/tasks/<task_id>/process_dataset/script.R
library(anndata)

## VIASH START
par <- list(
  input = "resources_test/common/pancreas/dataset.h5ad",
  output_dataset = "dataset.h5ad",
  output_solution = "solution.h5ad",
)
## VIASH END

cat(">> Load common dataset\n")
adata <- anndata::read_h5ad(par[["input"]])

# <- implement logic for splitting the common dataset
#    into a dataset and a solution object here.

cat(">> Create dataset for methods\n")
output_dataset <- anndata::AnnData(
  # <- add the data you wish to store in the output_dataset object here
)

cat(">> Create solution object for metrics\n")
output_solution <- anndata::AnnData(
  # <- add the data you wish to store in the output_solution object here
)

print(">> Write to disk\n")
output_dataset$write_h5ad(par[["output_dataset"]])
output_solution$write_h5ad(par[["output_solution"]])
Tip

You can use helper functions read_config_slots_info and subset_anndata defined in src/common/helper_functions/subset_anndata.py to easily subset the AnnData to having only the slots specified by the corresponding API files. Examples of tasks where this is used: Dimensionality reduction, Label projection.

Step 3: Create test resources

Next, we’ll create part of the test resources by creating a test resource script. Create a Bash script which runs the process_dataset component.

src/tasks/<task_id>/resource_test_script/pancreas.sh
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

TASK_ID="<task_id>"
RAW_DATA=resources_test/common/pancreas/dataset.h5ad
DATASET_DIR=resources_test/$TASK_ID/pancreas

if [ ! -f $RAW_DATA ]; then
    echo "Error! Could not find raw data"
    exit 1
fi

mkdir -p $DATASET_DIR

# Run the process dataset component
viash run "src/tasks/$TASK_ID/process_dataset/config.vsh.yaml" -- \
    --input "$RAW_DATA" \
    --output_dataset "$DATASET_DIR/dataset.h5ad" \
    --output_solution "$DATASET_DIR/solution.h5ad"

# TODO: uncomment and update this code block when a method component has been added to your task
# # Run one method
# viash run "src/tasks/$TASK_ID/methods/<method_id>/config.vsh.yaml" -- \
#     --input "$DATASET_DIR/dataset.h5ad" \
#     --output "$DATASET_DIR/prediction.h5ad"

# TODO: uncomment and update this code block when a metric component has been added to your task
# # Run one metric
# viash run src/tasks/$TASK_ID/metrics/<metric_id>/config.vsh.yaml -- \
#     --input_prediction $DATASET_DIR/prediction.h5ad \
#     --input_solution $DATASET_DIR/solution.h5ad \
#     --output $DATASET_DIR/score.h5ad
Note

You’ll have to substitute the correct value for <task_id>, and also update the arguments to the output arguments set in the API file.

Now run the script to generate the first couple of test resources:

chmod +x src/tasks/<task_id>/resource_test_script/pancreas.sh
src/tasks/<task_id>/resource_test_script/pancreas.sh

Next steps

Having created the dataset processor and the first test resources, you can now start creating method and metric components. We recommend creating one method component and then one metric component first, so you can use those components to generate the test resources needed for testing all of the components.