Dataset processor

In this section, we'll create the Dataset processor component and generate the first test resources. The Dataset processor component uses one of the pre-defined Common datasets to create the AnnData files used in your task. These AnnData objects can then serve as the first test resources to be used to test Method and Metric components.

Note

In the task template, the Dataset processor component is already created. You can find the component in the src/data_processors/process_dataset directory. You can use this component as a reference to create your own Dataset processor component or adjust as needed.

Why?

The Dataset processor component splits up the Common dataset into different AnnData objects such that both Method and Metric components can only observe the information specified by the file format specification files. This not only safe-guards against data leakage, but also ensures that the Method component can't accidentally make changes to the data structure that the Metric component relies on.

You can use helper functions subset_h5ad_by_format defined in common/helper_functions/subset_h5ad_by_format.py to easily subset the AnnData to having only the slots specified by the corresponding API files. Examples of tasks where this is used: Task template Dimensionality reduction.

Step 3: Create test resources

Next, we'll create part of the test resources by creating a test resource script. Create a Bash script which runs the process_dataset component.

scripts/create_datasets/test_resources.sh

 #!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

# remove this when you have implemented the script #<1>
echo "TODO: replace the commands in this script with the sequence of components that you need to run to generate test_resources." #<1>
echo "  Inside this script, you will need to place commands to generate example files for each of the 'src/api/file_*.yaml' files." #<1>
exit 1 #<1>

set -e

RAW_DATA=resources_test/common
DATASET_DIR=resources_test/task_template

mkdir -p $DATASET_DIR

# process dataset
echo Running process_dataset
nextflow run . \
  -main-script target/nextflow/workflows/process_datasets/main.nf \
  -profile docker \
  --publish_dir "$DATASET_DIR" \
  --id "pancreas" \
  --input "$RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad" \
  --output_train '$id/train.h5ad' \
  --output_test '$id/test.h5ad' \
  --output_solution '$id/solution.h5ad' \
  --output_state '$id/state.yaml'

# run one method
viash run src/methods/knn/config.vsh.yaml -- \
    --input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
    --input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad

# run one metric
viash run src/metrics/accuracy/config.vsh.yaml -- \
    --input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad \
    --input_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad \
    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad

# only run this if you have access to the openproblems-data bucket #<2>
# aws s3 sync --profile op \
#   "$DATASET_DIR" s3://openproblems-data/resources_test/task_template \
#   --delete --dryrun

Remove this when you have implemented the script.
Only run this if you have access to the openproblems-data bucket.

Note

You'll have to update the arguments to the output arguments set in the API file.

Now run the script to generate the first couple of test resources:

 scripts/create_datasets/test_resources.sh

Dataset processor

Why?

How?

Step 1: Create the Viash config

Step 2: Create the script

Step 3: Create test resources

Docs Navigation

Dataset processorlink

Why?link

How?link

Step 1: Create the Viash configlink

Step 2: Create the scriptlink

Step 3: Create test resourceslink

Dataset processor

Why?

How?

Step 1: Create the Viash config

Step 2: Create the script

Step 3: Create test resources