JSON Sample Sheets

Rough Structure

Overall, a JSON sample sheet looks as follows.

The sheet is described as a JSON object and is given an ID, a title, and a description.

{
    "id": "https://omics.cubi.bihealth.org/experiments/33",
    "title": "Tumor/Normal Study Example",
    "description": "Example biomedical sheet for standard tumor/normal study",

This is followed by a section describing the optional additional fields for each of the objects.

    "extraInfoDefs": {

The extra fields can be described in each schema, e.g., as in the following example referring to the NCBI organism taxonomy.

            "ncbiTaxon": {
                "docs": "Reference to NCBI taxonomy",
                "key": "taxon",
                "type": "string",
                "pattern": "^NCBITaxon_[1-9][0-9]*$"
            }
        },

Or by refering to the built-in standard fields bundled with the biomedsheets module.

        "bioSample": {
            "uberonCellSource": { "$ref": "resource://biomedsheets/data/std_fields.json#/extraInfoDefs/template/uberonCellSource" }
        },

There can be field definitions for each data type.

        "testSample": {},
        "ngsLibrary": {},
        "msProteinPool": {}
    },

Then, the bio entities are given. They are stored in a JSON object/map. The attribute name/key is the secondary ID that has to be unique within the project. Each BioEntity must have a primary key, can have some extra IDs and additional information (as described in extraInfoDefs above).

        "BIH_001": {
            "pk": "123001",
            "extraIds": [
                "http://cancer-registry.hospital.de/PAT12345",
                "http://virtual-cuts.pathology.hospital.de/SMPL000021"
            ],
            "extraInfo": {
                "ncbiTaxon": "NCBITaxon_9606"
            },
            "bioSamples": {

Then, each BioEntity can have a number of BioSamples. Note that the secondary id is given without the prefix of the secondary ID of the containing BioSample. The BioSample must have a global ID pk, can have extra infos attached (and, of course extra IDs, omitted here).

                "N1": {
                    "pk": "234001",
                    "extraInfo": {
                        "uberonCellSource": "UBERON:0000178"
                    },

Recursively, each BioSample can have a number of TestSamples which can have a number of NGSLibrary’s and MSProteinPool’s.

                    "testSamples": {
                        "DNA1": {
                            "pk": "345001",
                            "extraInfo": { "extractionType": "DNA" },
                            "ngsLibraries": {
                                "WES1": {
                                    "pk": "567001",
                                    "extraInfo": { "libraryType": "WES" }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Sheet Validation

Validation of sample sheets has four steps:

  1. the sheet must be valid JSON,
  2. expansion of JSON pointers { "$ref": "<URL>" } is performed,
  3. the sheet must conform to the JSON schema bundled with biomedsheets (in the future it will be versionised at some URL),
  4. additional validation based on extraInfoDefs is performed.

Steps 1 and 3 can be performed with standard tools or libraries. Step 2 is relatively easy and the biomedsheets module ships with code for performing this easily (the functionality is available as a Python program as well). Step 4 is not implemented yet.

On the one hand, custom fields allow for the definition of arbitrary “simple” values. Currently, it is possible to have boolean, numbers, strings, enums and lists of the atomic types. On the other hand, using JSON pointers, centrally defined field types can be used. This allows for easy sharing of data types and easier computat