graph LR common_dataset[Common<br/>dataset]:::anndata subgraph task_specific[Task-specific workflow] dataset_processor[/Dataset<br/>processor/]:::component solution[Solution]:::anndata masked_data[Dataset]:::anndata method[/Method/]:::component control_method[/Control<br/>method/]:::component output[Output]:::anndata metric[/Metric/]:::component score[Score]:::anndata end common_dataset --- dataset_processor --> masked_data & solution masked_data --- method --> output masked_data & solution --- control_method --> output solution & output --- metric --> score
Task structure
Before defining a new task in OpenProblems, it’s important to understand the typical structure of an OpenProblems task (Figure 1).
A task typically consists of a dataset processor, methods, control methods and metrics. Each component has a well-defined input-output interface, for which the file formats in the resulting AnnData are also described.
File and component formats
Path: src/tasks/<task_id>/api
This folder contains YAML specifications for task-specific file formats and component interfaces.
Dataset processor
Path: src/tasks/<task_id>/process_dataset
This components processes a Common dataset into task-specific dataset objects. In supervised tasks, this component will usually output a solution, a training dataset and a test dataset. In unsupervised tasks, this component usually output a solution and a masked dataset.
Methods
Path: src/tasks/<task_id>/methods
This folder contains method components. Each method component outputs a prediction given the training and test datasets (when applicable).
Control methods
Path: src/tasks/<task_id>/control_methods
This folder contains control components for the task. These components have the same interface as the regular methods but also receive the solution object as input. It serves as a starting point to test the relative accuracy of new methods in the task, and also as a quality control for the metrics defined in the task. A control method can either be a positive control or a negative control, which set a maximum and minimum threshold for performance, so any new method should perform better than the negative control methods and worse than the positive control method.
A positive control is a method where the expected results are known, thus resulting in the best possible value for any metric outcome measure.
A negative control is a simple, naive, or random method that does not rely on any sophisticated techniques or domain knowledge.
Metrics
Path: src/tasks/<task_id>/metrics
This folder contains metric components. Each metric component outputs one or more metric results given a solution object and a method output object.
Benchmarking pipeline
Path: src/tasks/<task_id>/workflows
This folder contains a Nextflow pipeline defining the benchmarking workflow for this task.
Resource generation scripts
Path: src/tasks/<task_id>/resources_scripts
This folder contains scripts for generating benchmarking resources required for the task.
Test resource generation scripts
Path: src/tasks/<task_id>/resources_test_scripts
This folder contains scripts for generating test resources for the task.