graph LR normalization:::group dataset_processors:::group raw_dataset["Raw dataset"]:::anndata common_dataset[Common<br/>dataset]:::anndata test_dataset[Test<br/>dataset]:::anndata dataset_loader[/Dataset<br/>loader/]:::component subgraph normalization [Normalization methods] log_cpm[/"Log CPM"/]:::component l1_sqrt[/"L1 sqrt"/]:::component log_scran_pooling[/"Log scran<br/>pooling"/]:::component sqrt_cpm[/Sqrt CPM/]:::component end subgraph dataset_processors[Dataset processors] pca[/PCA/]:::component hvg[/HVG/]:::component knn[/KNN/]:::component end dataset_loader --> raw_dataset --> log_cpm & l1_sqrt & log_scran_pooling & sqrt_cpm --> pca --> hvg --> knn --> common_dataset subset[/Subset/]:::component common_dataset --> subset --> test_dataset
src/datasets/
The dataset processing pipeline uses dataset loaders to create raw dataset files (Figure 1). The raw dataset files are then processed to generate common dataset files. Common dataset files are used in one or more tasks.
Directory structure
Dataset file and component formats (
src/datasets/api
): This folder contains specifications for dataset file formats and component interfaces. This documentation page was generated mostly by reading in these files.Dataset loader (
src/datasets/loaders
): This folder contains components to load and format datasets for various sources.Dataset normalization (
src/datasets/normalization
): This folder contains various dataset normalization methods.Dataset processors (
src/datasets/processors
): This folder contains components for processing datasets, such as computing a KNN, PCA, HVG or subsetting.Resource generation scripts (
src/common/resources_scripts
): This folder contains scripts for generating the datasets using the dataset loaders, normalization methods and processors.Test resource generation scripts (
src/common/resources_test_scripts
): This folder contains scripts for generating test resources.
Component type: Dataset loader
Path: src/datasets/loaders
A component which generates a “Common dataset”.
Arguments:
Name | Type | Description |
---|---|---|
--output |
file |
(Output) An unprocessed dataset as output by a dataset loader. |
File format: Raw dataset
An unprocessed dataset as output by a dataset loader.
Example file: resources_test/common/pancreas/raw.h5ad
Description:
NA
Format:
AnnData object
obs: 'celltype', 'batch', 'tissue'
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
Slot description:
Slot | Type | Description |
---|---|---|
obs["celltype"] |
string |
(Optional) Cell type information. |
obs["batch"] |
string |
(Optional) Batch information. |
obs["tissue"] |
string |
(Optional) Tissue information. |
layers["counts"] |
integer |
Raw counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["data_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["data_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
Component type: Dataset normalization
Path: src/datasets/normalization
A normalization method which processes the raw counts into a normalized dataset.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
An unprocessed dataset as output by a dataset loader. |
--output |
file |
(Output) A normalized dataset. |
--layer_output |
string |
(Optional) The name of the layer in which to store the normalized data. Default: normalized . |
--obs_size_factors |
string |
(Optional) In which .obs slot to store the size factors (if any). Default: size_factors . |
File format: Normalized dataset
A normalized dataset
Example file: resources_test/common/pancreas/normalized.h5ad
Description:
NA
Format:
AnnData object
obs: 'celltype', 'batch', 'tissue', 'size_factors'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
Slot description:
Slot | Type | Description |
---|---|---|
obs["celltype"] |
string |
(Optional) Cell type information. |
obs["batch"] |
string |
(Optional) Batch information. |
obs["tissue"] |
string |
(Optional) Tissue information. |
obs["size_factors"] |
double |
(Optional) The size factors created by the normalisation method, if any. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalised expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["data_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["data_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
Component type: PCA
Path: src/datasets/processors
Computes a PCA embedding of the normalized data.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A normalized dataset. |
--layer_input |
string |
(Optional) Which layer to use as input. Default: normalized . |
--output |
file |
(Output) A normalised dataset with a PCA embedding. |
--obsm_embedding |
string |
(Optional) In which .obsm slot to store the resulting embedding. Default: X_pca . |
--varm_loadings |
string |
(Optional) In which .varm slot to store the resulting loadings matrix. Default: pca_loadings . |
--uns_variance |
string |
(Optional) In which .uns slot to store the resulting variance objects. Default: pca_variance . |
--num_components |
integer |
(Optional) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation. |
Component type: HVG
Path: src/datasets/processors
Computes the highly variable genes scores.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A normalised dataset with a PCA embedding. |
--layer_input |
string |
(Optional) Which layer to use as input. Default: normalized . |
--output |
file |
(Output) A normalised dataset with a PCA embedding and HVG selection. |
--var_hvg |
string |
(Optional) In which .var slot to store whether a feature is considered to be hvg. Default: hvg . |
--var_hvg_score |
string |
(Optional) In which .var slot to store the gene variance score (normalized dispersion). Default: hvg_score . |
--num_features |
integer |
(Optional) The number of HVG to select. Default: 1000 . |
Component type: KNN
Path: src/datasets/processors
Computes the k-nearest-neighbours for each cell.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A normalised dataset with a PCA embedding and HVG selection. |
--layer_input |
string |
(Optional) Which layer to use as input. Default: normalized . |
--output |
file |
(Output) A normalised data with a PCA embedding, HVG selection and a kNN graph. |
--key_added |
string |
(Optional) The neighbors data is added to .uns[key_added] , distances are stored in .obsp[key_added+'_distances'] and connectivities in .obsp[key_added+'_connectivities'] . Default: knn . |
--num_neighbors |
integer |
(Optional) The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Default: 15 . |
File format: Common dataset
A dataset processed by the common dataset processing pipeline.
Example file: resources_test/common/pancreas/dataset.h5ad
Description:
This dataset contains both raw counts and normalized data matrices, as well as a PCA embedding, HVG selection and a kNN graph.
Format:
AnnData object
obs: 'celltype', 'batch', 'tissue', 'size_factors'
var: 'hvg', 'hvg_score'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
varm: 'pca_loadings'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'data_url', 'data_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'pca_variance', 'knn'
Slot description:
Slot | Type | Description |
---|---|---|
obs["celltype"] |
string |
(Optional) Cell type information. |
obs["batch"] |
string |
(Optional) Batch information. |
obs["tissue"] |
string |
(Optional) Tissue information. |
obs["size_factors"] |
double |
(Optional) The size factors created by the normalisation method, if any. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
integer |
A ranking of the features by hvg. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
obsp["knn_distances"] |
double |
K nearest neighbors distance matrix. |
obsp["knn_connectivities"] |
double |
K nearest neighbors connectivities matrix. |
varm["pca_loadings"] |
double |
The PCA loadings matrix. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalised expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["data_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["data_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["pca_variance"] |
double |
The PCA variance objects. |
uns["knn"] |
object |
Supplementary K nearest neighbors data. |
Component type: Subset
Path: src/datasets/processors
Sample cells and genes randomly.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A dataset processed by the common dataset processing pipeline. |
--output |
file |
(Output) A dataset processed by the common dataset processing pipeline. |