Note: This documentation has recently been migrated. You can find the old documentation at here.

Create a dataset loader

A dataset loader generates one or more raw datasets that are processed by the Dataset preprocessing workflow to create a common dataset usable by multiple benchmarking tasks.

Tip

Make sure you have followed the "Getting started" guide.

OpenProblems datasets

Common datasets are created by generating raw datasets with a data loader and running them through the pre-processing pipeline (Figure 1). Afterwards, further task-specific processing occurs prior to the task-specific benchmarking workflow.

Figure 1: Flow of data in OpenProblems benchmarks. All datasets are processed by a common processing pipeline. Further task-specific processing can occur at prior to the task-specific benchmarking workflow. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components

See the reference documentation for more information on how each of these steps works.

Step 1: Create a directory for the dataset loader

Create a Viash component for a dataset loader:

mkdir src/datasets/loaders/myloader

Tip

Take a look at the dataset loaders that are already in the src/datasets/loaders! Likely there is already a dataset loader that already does something similar to what you need.

Step 2: Create a Viash config

Next, create a config for the dataset loader. The Viash config contains metadata of your dataset, which script is used to run it, and the required dependencies. The simplest dataset loader you can create looks as follows.

For more parameter options, refer to "Parameters" section.

Step 3: Create a script

Next, create a script that will generate or load the dataset. Here we show an example script that generates a random dataset, but check out src/datasets/loaders for real data examples. The script must ensure that that the output anndata object has the format as described in the "Format of a raw dataset object" section.

Step 4: Run the component

Try running your component! You can start off by running your script inside your IDE.

To check whether your component works as a standalone component, run the following commands.

(Re)build the Docker container after changing the platforms section in the Viash config:

viash run src/datasets/loaders/myloader/config.vsh.yaml -- \
  ---setup cachedbuild

Run the component:

viash run src/datasets/loaders/myloader/config.vsh.yaml -- \
  --output mydataset.h5ad

Parameters

It's possible to add arguments to the dataset loader by adding additional entries to the functionality.arguments section in the config.vsh.yaml. For example:

arguments:
  - name: "--n_obs"
    type: "integer"
    description: "Number of cells to generate."
    default: 100
  - name: "--n_vars"
    type: "integer"
    description: "Number of genes to generate."
    default: 100

You can then use the n_obs and n_vars values in the par object to get access to the runtime parameters:

Format of a raw dataset object

Ideally, the AnnData output object should at least contain the following slots:

## File format: raw.h5ad

NA

Example file: `resources_test/common/pancreas/raw.h5ad`