AnnData object
obs: 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'organism', 'organism_ontology_term_id', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'batch', 'soma_joinid', 'size_factors'
var: 'feature_id', 'feature_name', 'soma_joinid', 'hvg', 'hvg_score'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
varm: 'pca_loadings'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'pca_variance', 'knn'
Datasets
To ensure interoperability between components, OpenProblems uses AnnData as the standard data format for both input and output files of components, and strict requirements are imposed on the format of these files.
X
, e.g. gene expression values), annotations of observations (obs
, e.g. cell metadata), annotations of variables (var
, e.g. gene metadata), and unstructured annotations (uns
). This organization makes it easy to work with complex datasets while maintaining data integrity and ensuring a standardized structure across different components.
File format specifications
All OpenProblems tasks contain specifications for exact format of all H5AD inputs and outputs for all components in the workflow. These specifications contain information on the required and optional fields in the AnnData objects, as well as descriptions of those fields. These files are used to validate the input and output files of components, and to generate the documentation for the API of each component.
You should be able to find these specifications in the src/tasks/*/api/file_*.yaml
of each task. Here’s an example of such a specification: src/datasets/api/file_raw.yaml
.
For more information on how these specifications are formatted, see “Design the API”.
Common datasets
OpenProblems offers a collection of common datasets that can be used to test components and run the benchmarking tasks. These datasets are generated by dataset loaders and processed by a common processing pipeline stored in src/datasets
.
graph LR normalization:::group dataset_processors:::group raw_dataset["Raw dataset"]:::anndata common_dataset[Common<br/>dataset]:::anndata test_dataset[Test<br/>dataset]:::anndata dataset_loader[/Dataset<br/>loader/]:::component subgraph normalization [Normalization methods] log_cp10k[/"Log CP10k"/]:::component l1_sqrt[/"L1 sqrt"/]:::component log_scran_pooling[/"Log scran<br/>pooling"/]:::component sqrt_cp10k[/Sqrt CP10k/]:::component end subgraph dataset_processors[Dataset processors] pca[/PCA/]:::component hvg[/HVG/]:::component knn[/KNN/]:::component end dataset_loader --> raw_dataset --> log_cp10k & l1_sqrt & log_scran_pooling & sqrt_cp10k --> pca --> hvg --> knn --> common_dataset subset[/Subset/]:::component common_dataset --> subset --> test_dataset
File format of common datasets
The format of common datasets is based on the CELLxGENE schema along with additional metadata that is specific to OpenProblems (in the .uns
slot) and some additional output generated by our dataset preprocessors (in the .layers
, .obsm
, obsp
and .varm
slots).
Here is what a typical common dataset looks like when printed to the console:
Some slots might not be available depending on the origin of the dataset. Please visit the reference documentation on the common dataset file format used by OpenProblems for more information on each of the different slots.
In OpenProblems, the X slot in the AnnData objects is typically not defined (None
in Python, NULL
in R). Instead, the raw counts and normalised expression data are defined as layers.
Available datasets
Our datasets are stored in s3://openproblems-data/resources/datasets
. Please visit the datasets page for more information on each of the available datasets.