Design the API

When creating a new OpenProblems task, it’s essential to design the API for the input/outputs of components and their file formats. Concretely, you need to define:

When put together, a typical task API looks somewhat like this:

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    solution[Solution]:::anndata
    masked_data[Dataset]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> masked_data & solution
  masked_data --- method --> output
  masked_data & solution --- control_method --> output
  solution & output --- metric --> score
Figure 1: Overview of a typical benchmarking workflow in an OpenProblems task. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components.

The dimensionality reduction task is an example of an OpenProblems task with this topology, where the output is an embedding of the original dataset, and the solution is cell annotations which are used to verify whether the resulting embedding represents the intended biological information.

Do not forget to add the task info yaml file, see Define task.

Why?

Having a formally defined API ensures consistency and interoperability across different components of your task. This makes it easier for others to contribute and build upon your work. Not only that, but creating API files (partially) automates the following steps:

How?

We’ll need to create API files for each component and AnnData file separately. However, this is actually quite easy to do, as we will show in the following sections.

Step 1: Create task API workflow

First start by creating a workflow similar to what is shown in Figure 1. We recommend drawing the workflow on paper at first.

Here are the most common types of components and file formats:

  • Common dataset: OpenProblems offers a standard collection of datasets, which can be used to kickstart a new task. Please check the reference documentation regarding the format of the common dataset file format.
  • Dataset processor: This component ingests a Common dataset and splits it into one or more task-specific dataset objects. We recommend at least having a Dataset and Solution object, such that a Method component never “sees” the ground-truth information needed by the Metric component.
  • Dataset: The data used by a method to create an output (i.e. prediction).
  • Solution: The ground-truth information needed by a Metric to compare an output against. It’s highly recommended to store the ground-truth information as a separate AnnData object, such that a Method cannot (accidentally) cheat.
  • Method: An algorithm used to make predictions for or process an input dataset in some way.
  • Control method: A quality control for methods, metrics and the pipeline as a whole. A control method can either be a positive control (which uses the ground-truth information in from the solution to create a perfect output) or a negative control (which uses random distributions to generate outputs in the correct format).
  • Output: The output generated by a (control) method.
  • Metric: A quatitative measure used to evaluate the performance of a method.
  • Score: An AnnData object containing one or more metric values.

Figure 2 and Figure 3 are examples of two OpenProblems tasks with slightly different workflow layouts.

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  processor[/Dataset<br/>processor/]:::component
  solution[Solution]:::anndata
  train[Training<br/>dataset]:::anndata
  test[Test<br/>dataset]:::anndata
  method[/Method/]:::component
  control[/Control<br/>method/]:::component
  output[Output]:::anndata
  metric[/Metric/]:::component
  score[Score]:::anndata

  common_dataset --- processor --> train & test & solution
  train & test --- method --> output
  train & test & solution --- control --> output
  solution & output --- metric --> score
Figure 2: Example of a task where a dataset’s cells are split into a training dataset and test dataset. Example: in the Label projection task, the Train dataset contains both the raw counts and the cell type labels of one set of cells, while the Test and Solution datasets contain the raw counts and cell type labels of the remaining set of cells (respectively).
graph LR
  subgraph common_dataset[Common dataset]
    common_mod1[RNA counts]:::anndata
    common_mod2[ADT counts]:::anndata
  end
  subgraph dataset[Dataset]
    mod1[RNA counts]:::anndata
    mod2[ADT counts]:::anndata
  end
  processor[/Dataset<br/>processor/]:::component
  solution[Solution]:::anndata
  dataset[Dataset]:::anndata
  method[/Method/]:::component
  control[/Control<br/>method/]:::component
  output[Output]:::anndata
  metric[/Metric/]:::component
  score[Score]:::anndata

  common_mod1 & common_mod2 --- processor --> mod1 & mod2 & solution
  mod1 & mod2 --- method --> output
  mod1 & mod2 & solution --- control --> output
  solution & output --- metric --> score
Figure 3: Example of a multimodal task. Example: in the Multimodal data integration task, a dataset consists of both an RNA AnnData and ADT AnnData. However, the RNA and ADT AnnData objects are stored as two separate files.

Step 2: Create file formats

Now that you’ve created the topology of the task workflow, the next step is to translate that information into the required file format specification files.

Let’s start by creating one for the solution object:

src/tasks/<task_id>/api/file_solution.yaml
1type: file
2description: "FILL IN: what this file represents"
3example: "resources_test/<task_id>/pancreas/solution.h5ad"
info:
4  label: Solution
5  slots:
1
This YAML file will be used to define the arguments of a Viash component. This must always be set to type: file.
2
Description of the file, useful for quickly understanding what type of data such a file represents. Used for generating reference documentation.
3
An example of this file. At this stage, this file does not exist yet, but it will be created later on, as this file is used for unit testing components.
4
A short label used to represent the file in diagrams in the reference documentation.
5
Which AnnData slots need to be present in the file which will be defined in Step 4.

Create a YAML file for each of the other AnnData files in the task workflow. For example, src/tasks/<task_id>/api/file_dataset.yaml, src/tasks/<task_id>/api/file_output.yaml, and so on.

Tip

Each file format specification file is actually a Viash file argument. That’s because these YAML files will be used as arguments in the different component types.

Step 3: Create component types

Next, we will create the API specification files for each of the components (i.e. purple rhomboids) in your diagram.

Start by creating the method component type:

src/tasks/<task_id>/api/comp_method.yaml
functionality:
1  namespace: <task_id>/methods
2  info:
3    type: method
    type_info:
4      label: "Method"
5      description: "FILL IN: A description of what this type of component does."
6  arguments:
    - name: "--input"
      __merge__: file_dataset.yaml
      required: true
    - name: "--output"
      __merge__: file_prediction.yaml
      required: true
      direction: output
1
The namespace for the component type. format <task_id>/<component_type>. The namespace is used to group similar components together and ensures that they can be easily found and used within the task.
2
Metadata about the component type.
3
A unique identifier for the type of component.
4
A formatted label for the component type.
5
A description of the component type.
6
The arguments that the component accepts. Each argument has a name (e.g., --input), a direction (input (default) or output) and whether it’s required or not. Note that this information is partially provided by merging the file API YAML file specified earlier, using the __merge__ notation.

Create a YAML file for each of the other component types files in the task workflow. For example, src/tasks/<task_id>/api/comp_dataset_processor.yaml, src/tasks/<task_id>/api/comp_metric.yaml, and so on.

Tip

Again, each component type is formatted as a Viash config file, because they will be used to create components.

Step 4: Add slots to file formats

Finally, the last step is to define the actual required and optional slots each of the file format specifications. Since each of these files are AnnData HDF5 files, the file format specifications is structured analagously to the AnnData data structures: layers, obs, obsm, obsp, var, varm, varp, and uns.

Below is the slot information of the solution AnnData object:

src/tasks/<task_id>/api/file_solution.yaml
type: file
description: "FILL IN: what this file represents"
example: "resources_test/<task_id>/pancreas/solution.h5ad"
info:
  label: Solution
1  slots:
2    layers:
      - type: integer
        name: counts
        description: Raw counts
      - type: double
        name: normalized
        description: Normalized counts
3    obs:
      - type: string
        name: label
        description: Ground truth cell type labels
      - type: string
        name: batch
        description: Batch information
4    var:
      - type: boolean
        name: hvg
        description: Whether or not the feature is considered to be a 'highly variable gene'
        required: true
      - type: integer
        name: hvg_score
        description: A ranking of the features by hvg.
        required: true
5    obsm:
      - type: double
        name: X_pca
        description: The resulting PCA embedding.
        required: true
6    uns:
      - type: string
        name: dataset_id
        description: "A unique identifier for the dataset"
        required: true
      - type: string
        name: normalization_id
        description: "Which normalization was used"
        required: true
1
The mandatory and optional slots in the AnnData file.
2
Specification of one or more AnnData layers (matrices).
3
Specification for cell-level metadata (one or more columns).
4
Specification for feature-level metadata (one or more columns).
5
Specification for unstructured data.
6
Other AnnData slots.

Each required or optional slot in the file format should have the following fields:

  • name: The name of the slot.
  • type: Which data type (string, boolean, integer or double).
  • description: What this data represents.
  • required: Whether or not this slot is required (default: true).

Go through each file format specification file and add the expected slots accordingly.

Tip

Look at the Common dataset reference docs to see which slots the common datasets have. The AnnData file at resources_test/common/pancreas/dataset.h5ad is also an example of a Common dataset, though note that this object contains more slots than what is defined by the spec.