Glossary

Glossary of terms used in the documentation.

AnnData

AnnData, short for “Annotated Data”, is a file format for handling annotated, high-dimensional biological data (Virshup et al. 2021). It is a standard data format in the single-cell community, and is supported by many single-cell analysis tools, including Scanpy and CellxGene.

AnnData objects have a structured format that includes the main data matrix (X, e.g. gene expression values), annotations of observations (obs, e.g. cell metadata), annotations of variables (var, e.g. gene metadata), and unstructured annotations (uns). This organization makes it easy to work with complex datasets while maintaining data integrity and ensuring a standardized structure across different components.

Files with the .h5ad extension represent AnnData objects stored in an HDF5 file. AnnData objects can be opened in Python using the anndata.read_h5ad() function, and in R using the anndata::read_h5ad() function. Technically it can be read in any language using an HDF5 library.

AWS

Amazon Web Services (AWS) is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. OpenProblems uses AWS to store and distribute datasets and resources on S3, and to run Nextflow workflows using AWS Batch.

CELLxGENE census

A cloud-based library of single-cell RNA-seq datasets, developed by the Chan Zuckerberg Initiative. It provides a user-friendly interface for exploring and retrieving single-cell RNA-seq data, and is widely used in the single-cell community.

OpenProblems uses CELLxGENE census to retrieve raw datasets, and the CELLxGENE schema to define the file format specification for raw datasets.

Common dataset

An AnnData object that follows the common dataset format. A common dataset is generated by the dataset processing workflow and is used as input for multiple benchmarking tasks.

Common datasets are stored at s3://openproblems-bio/resources/datasets/ and are processed into task-specific AnnData objects by dataset processors and subsequently stored at s3://openproblems-bio/resources/<task_id>.

For a complete list of available common datasets, see the dataset overview page.

Common dataset format

The format of common datasets is based on the CELLxGENE schema along with additional metadata that is specific to OpenProblems (in the .uns slot) and some additional output generated by our dataset preprocessors (in the .layers, .obsm, .obsp and .varm slots).

Here is what a typical common dataset looks like when printed to the console:

AnnData object
  obs: 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', ...
  var: 'feature_id', 'feature_name', 'soma_joinid', 'hvg', 'hvg_score'
  obsm: 'X_pca'
  obsp: 'knn_distances', 'knn_connectivities'
  varm: 'pca_loadings'
  layers: 'counts', 'normalized'
  uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', ...

Some slots might not be available depending on the origin of the dataset. Please visit the reference documentation for a detailed description of the available slots and purpose.

Component interface

A component interface is a metadata file which describes the expected inputs and outputs of a component. This is used to verify whether a component is valid and to automatically generate documentation for tasks and the components therein. Component interface files are typically stored in src/**/api/comp_*.yaml.

Control method

A control method is used to test the relative performance of all other methods, and also as a quality control for the pipeline as a whole. A control method can either be a positive control or a negative control. The positive control and negative control methods set a maximum and minimum threshold for performance, so any new method should perform better than the negative control methods and worse than the positive control metho

Dataset

A dataset is one or more AnnData objects that are used as input for a benchmarking task. To ensure interoperability between components, each component has a strict file format specification for validating whether an input or output AnnData object is valid.

OpenProblems offers a collection of datasets that can be used to test components and run the benchmarking tasks. Raw datasets are generated by dataset loaders and processed into common datasets by a dataset processing workflow. Testing resources are typically subsampled versions of the common datasets.

Testing resources are stored at s3://openproblems-bio/resources_test/, while the processed datasets are stored at s3://openproblems-bio/resources/.

Dataset loader

A Viash component that downloads and stores the dataset as an AnnData file.

Dataset processing workflow

A workflow that processes a raw dataset into a common dataset. See the reference for more information.

Dataset processor

A Viash component that a common dataset into task-specific dataset objects.

Docker

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.

File format specification

A file format specification is a metadata file which describes the expected structure of an AnnData object. This is used to verify whether an AnnData object is valid and to automatically generate documentation for tasks and the components therein. File format specification files are typically stored in src/**/api/file_*.yaml.

GitHub Actions

GitHub Actions is a CI/CD service provided by GitHub that allows developers to automate their software development workflows. It is integrated into GitHub and is commonly used for testing, building, and deploying code.

OpenProblems uses GitHub Actions to automatically run tests and build Docker containers and Nextflow modules for each component.

Method

A method is a computational tool that can be used to solve a specific problem in single-cell omics data analysis.

Metric

A metric is a quantitative measure used to evaluate the performance of different methods in solving a specific problem in single-cell omics data analysis.

Nextflow

A workflow management system that enables the development of portable and reproducible workflows.

Raw dataset

An unprocessed the dataset as generated by a dataset loader. The file format specification for raw datasets is based on the CELLxGENE schema and is stored at src/datasets/api/file_raw.yaml.

Task

A benchmarking task to evaluate the performance of different methods in solving a specific problem when analysing omics data. A task typically consists of a dataset processor, methods, control methods and metrics. Each component has a well-defined input-output interface, for which the file formats in the resulting AnnData are also described.

The source code of a task is located in src/tasks/<task_id> and is structured as follows:

api/task_info.yaml: Contains metadata about the task.
api/comp_*.yaml: Files defining component interfaces.
api/file_*.yaml: Files specifying file formats.
dataset_processor/: Converts common datasets into task-specific datasets.
methods/: Implements methods to solve the task.
control_methods/: Tests and controls the quality of other methods.
metrics/: Metrics for evaluating method performance.
workflows/: Nextflow workflow for benchmarking tasks.
resources_scripts/: Scripts to execute workflows.
resources_test_scripts/: Scripts to create test resources.

See the reference documentation for more information.

Test resources

The test resources is a set of files located at s3://openproblems-bio/resources_test/ that are used to test components and workflows. After first cloning the repository, the test resources can be downloaded by running the instructions in the Getting Started page. Test resources are typically subsampled versions of the common datasets.

Viash

Viash is a meta-framework for creating modular Nextflow workflows from Viash components (Cannoodt et al. 2021). It allows developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.

Specific benefits of Viash include:

Reproducible: Viash generates a Docker container for each component, ensuring that the component can be run in a reproducible environment.
Modular: Nextflow modules generated by Viash are more reusable and modular than typical Nextflow modules, since default parameters values and default directives can be overwritten by the user at runtime.
Robust: Viash allow for easily unit testing the functionality of a component.
Less boilerplate: Viash components are easier to write than typical Nextflow modules, since Viash takes care of a lot of the boilerplate Nextflow code (such as parsing and validating input sheets, generating a CLI, generating documentation).

Viash component

A Viash component is a combination of an R or Python script and a small amount of metadata that makes it easy to generate pipeline modules, facilitating the separation of component functionality from the pipeline workflow (Cannoodt et al. 2021). This enables developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.

A Viash component consists of three main parts: a Viash config, a script, and one or more unit tests.

References

Cannoodt, Robrecht, Hendrik Cannoodt, Eric Van de Kerckhove, Andy Boschmans, Dries De Maeyer, and Toni Verbeiren. 2021. “Viash: From Scripts to Pipelines.” arXiv. https://doi.org/10.48550/ARXIV.2110.11494.

Virshup, Isaac, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. 2021. “Anndata: Annotated Data.” https://doi.org/10.1101/2021.12.16.473007.