Table Fields

Core Table Fields

This section describes the common table fields. Generally, the pk field is an integer primary key that is to be automaticaly generated (i.e. autoincrement in RDBMS). The field secondary_id is an identifier assigned by the “data owner” (e.g., the collaboration partner). This identifier has to be unique within a given project but can be ambiguous globally.

A possible best practice is to enforce the secondary_id to only consist of alphanumeric characters and underscores. Then, they should be constructed as (none of the <Field> values should contain a hyphen itself):

<BioEntity>-<BioSample>-<TestSample>-<NGSLibrary>

(of course only up to “BioSample” for BioSamples etc.).

Examples are:

  • BioEntity secondary ids: 2355, BIH-234
  • BioSample secondary ids:
    • 2355-B1 (first blood sample from patient 2355)
    • BIH_234-N1 (first normal sample from patient BIH-234)
    • BIH_234-T2 (second tumor sample from patient BIH-234)
  • TestSample secondary ids:
    • 2355-B1-DNA1 (first DNA extraction from first blood sample)
    • BIH_234-T1-RNA1 (first RNA extraction from first tumor sample)
    • BIH_234-T2-DNA2 (second DNA extraction from second tumor sample)

Generally, the following are “core fields”.

BioEntity

  • pk: integer
  • secondary_id: string

BioSample

  • pk: integer
  • bio_entity: fk to BioEntity.pk
  • secondary_id: string

TestSample

  • pk: integer
  • bio_sample: fk to BioSample.pk
  • secondary_id: string

NGSLibrary

  • pk: integer
  • test_sample: fk to TestSample.pk
  • secondary_id: string

FlowCell

  • pk: integer
  • machine_name: string
  • flowcell_name: string

NGSLibraryOnFlowCell

  • pk: integer
  • ngs_library: fk to NGSLibrary.pk
  • flowcell: fk to FlowCell
  • lane: int

Common Table Fields

For many major use cases, the following table fields are useful additions to get a list of “common fields”.

For all tables, adding a list of strings with external IDs (e.g., called “external_ids”) is recommendable. This way, external resources can be linked out to. A recommendation is to use URLs for giving reads an unambiguous prefix. These URLs can be pseudo URLs or real entry points in remote REST APIs. Further, each record has a meta_data field for structured data in JSON format.

BioEntity

  • affected: boolean, optional field for specifying the “affected” state in rare disease studies
  • sex: {‘male’, ‘female’, ‘unknown’}, optional field for person’s sex in germline studies
  • father: fk to BioEntity.pk, optional fields for linking to father
  • mother: fk to BioEntity.pk, optional fields for linking to mother

BioSample

  • cell_type: string with controlled vocabulary, optional field for specifying cell type

TestSample

  • extraction_type: controlled vocabulary with extraction type, e.g. {‘DNA’, ‘RNA’} or a superset thereof; optional field for describing extracted data

NGSLibrary

  • library_kind: controlled vocabulary with library preparation type, e.g., {‘WES’, ‘WGS’, ‘RNA-seq’, ‘other’} or a superset thereof; required field for describing library type
  • kit: controlled vocabulary describing kit and version used for targeted sequencing, or RNA amplifcation method

NGSLIbraryOnFC

  • adapter_name: string, optional field describing name of used adapter barcode(s)
  • adapter_seq: string, optional field giving sequence of used adapter barcode(s)