src/datasets/

The dataset processing pipeline uses dataset loaders to create raw dataset files (Figure 1). The raw dataset files are then processed to generate common dataset files. Common dataset files are used in one or more tasks.

graph LR
  normalization:::group
  dataset_processors:::group
  raw_dataset["Raw dataset"]:::anndata
  common_dataset[Common<br/>dataset]:::anndata
  test_dataset[Test<br/>dataset]:::anndata
  dataset_loader[/Dataset<br/>loader/]:::component
  subgraph normalization [Normalization methods]
    log_cpm[/"Log CPM"/]:::component
    l1_sqrt[/"L1 sqrt"/]:::component
    log_scran_pooling[/"Log scran<br/>pooling"/]:::component
    sqrt_cpm[/Sqrt CPM/]:::component
  end
  subgraph dataset_processors[Dataset processors]
    pca[/PCA/]:::component
    hvg[/HVG/]:::component
    knn[/KNN/]:::component
  end
  dataset_loader --> raw_dataset --> log_cpm & l1_sqrt & log_scran_pooling & sqrt_cpm --> pca --> hvg --> knn --> common_dataset
  subset[/Subset/]:::component
  common_dataset --> subset --> test_dataset
Figure 1: Overview of the dataset processing workflow. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components.

Directory structure

  • Dataset file and component formats (src/datasets/api): This folder contains specifications for dataset file formats and component interfaces. This documentation page was generated mostly by reading in these files.

  • Dataset loader (src/datasets/loaders): This folder contains components to load and format datasets for various sources.

  • Dataset normalization (src/datasets/normalization): This folder contains various dataset normalization methods.

  • Dataset processors (src/datasets/processors): This folder contains components for processing datasets, such as computing a KNN, PCA, HVG or subsetting.

  • Resource generation scripts (src/common/resources_scripts): This folder contains scripts for generating the datasets using the dataset loaders, normalization methods and processors.

  • Test resource generation scripts (src/common/resources_test_scripts): This folder contains scripts for generating test resources.

Component type: Dataset loader

Path: src/datasets/loaders

A component which generates a “Common dataset”.

Arguments:

Name Type Description
--output file (Output) An unprocessed dataset as output by a dataset loader.

File format: Raw dataset

An unprocessed dataset as output by a dataset loader.

Example file: resources_test/common/pancreas/raw.h5ad

Description:

NA

Format:

AnnData object
 obs: 'celltype', 'batch', 'tissue'
 layers: 'counts'
 uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Slot description:

Slot Type Description
obs["celltype"] string (Optional) Cell type information.
obs["batch"] string (Optional) Batch information.
obs["tissue"] string (Optional) Tissue information.
layers["counts"] integer Raw counts.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["data_url"] string (Optional) Link to the original source of the dataset.
uns["data_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.

Component type: Dataset normalization

Path: src/datasets/normalization

A normalization method which processes the raw counts into a normalized dataset.

Arguments:

Name Type Description
--input file An unprocessed dataset as output by a dataset loader.
--output file (Output) A normalized dataset.
--normalization_id string (Optional) The normalization id to store in the dataset metadata. If not specified, the functionality name will be used.
--layer_output string (Optional) The name of the layer in which to store the normalized data. Default: normalized.
--obs_size_factors string (Optional) In which .obs slot to store the size factors (if any). Default: size_factors.

File format: Normalized dataset

A normalized dataset

Example file: resources_test/common/pancreas/normalized.h5ad

Description:

NA

Format:

AnnData object
 obs: 'celltype', 'batch', 'tissue', 'size_factors'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Slot description:

Slot Type Description
obs["celltype"] string (Optional) Cell type information.
obs["batch"] string (Optional) Batch information.
obs["tissue"] string (Optional) Tissue information.
obs["size_factors"] double (Optional) The size factors created by the normalisation method, if any.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalised expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["data_url"] string (Optional) Link to the original source of the dataset.
uns["data_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.

Component type: PCA

Path: src/datasets/processors

Computes a PCA embedding of the normalized data.

Arguments:

Name Type Description
--input file A normalized dataset.
--layer_input string (Optional) Which layer to use as input. Default: normalized.
--output file (Output) A normalised dataset with a PCA embedding.
--obsm_embedding string (Optional) In which .obsm slot to store the resulting embedding. Default: X_pca.
--varm_loadings string (Optional) In which .varm slot to store the resulting loadings matrix. Default: pca_loadings.
--uns_variance string (Optional) In which .uns slot to store the resulting variance objects. Default: pca_variance.
--num_components integer (Optional) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

Component type: HVG

Path: src/datasets/processors

Computes the highly variable genes scores.

Arguments:

Name Type Description
--input file A normalised dataset with a PCA embedding.
--layer_input string (Optional) Which layer to use as input. Default: normalized.
--output file (Output) A normalised dataset with a PCA embedding and HVG selection.
--var_hvg string (Optional) In which .var slot to store whether a feature is considered to be hvg. Default: hvg.
--var_hvg_score string (Optional) In which .var slot to store the gene variance score (normalized dispersion). Default: hvg_score.
--num_features integer (Optional) The number of HVG to select. Default: 1000.

Component type: KNN

Path: src/datasets/processors

Computes the k-nearest-neighbours for each cell.

Arguments:

Name Type Description
--input file A normalised dataset with a PCA embedding and HVG selection.
--layer_input string (Optional) Which layer to use as input. Default: normalized.
--output file (Output) A normalised data with a PCA embedding, HVG selection and a kNN graph.
--key_added string (Optional) The neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+'_distances'] and connectivities in .obsp[key_added+'_connectivities']. Default: knn.
--num_neighbors integer (Optional) The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Default: 15.

File format: Common dataset

A dataset processed by the common dataset processing pipeline.

Example file: resources_test/common/pancreas/dataset.h5ad

Description:

This dataset contains both raw counts and normalized data matrices, as well as a PCA embedding, HVG selection and a kNN graph.

Format:

AnnData object
 obs: 'celltype', 'batch', 'tissue', 'size_factors'
 var: 'hvg', 'hvg_score'
 obsm: 'X_pca'
 obsp: 'knn_distances', 'knn_connectivities'
 varm: 'pca_loadings'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'pca_variance', 'knn'

Slot description:

Slot Type Description
obs["celltype"] string (Optional) Cell type information.
obs["batch"] string (Optional) Batch information.
obs["tissue"] string (Optional) Tissue information.
obs["size_factors"] double (Optional) The size factors created by the normalisation method, if any.
var["hvg"] boolean Whether or not the feature is considered to be a ‘highly variable gene’.
var["hvg_score"] integer A ranking of the features by hvg.
obsm["X_pca"] double The resulting PCA embedding.
obsp["knn_distances"] double K nearest neighbors distance matrix.
obsp["knn_connectivities"] double K nearest neighbors connectivities matrix.
varm["pca_loadings"] double The PCA loadings matrix.
layers["counts"] integer Raw counts.
layers["normalized"] double Normalised expression values.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["data_url"] string (Optional) Link to the original source of the dataset.
uns["data_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["pca_variance"] double The PCA variance objects.
uns["knn"] object Supplementary K nearest neighbors data.

Component type: Subset

Path: src/datasets/processors

Sample cells and genes randomly.

Arguments:

Name Type Description
--input file A dataset processed by the common dataset processing pipeline.
--input_mod2 file (Optional) A dataset processed by the common dataset processing pipeline.
--output file (Output) A dataset processed by the common dataset processing pipeline.
--output_mod2 file (Optional, Output) A dataset processed by the common dataset processing pipeline.