sequence_annotations

This protocol defines annotations on GA4GH genomic sequences It includes two types of annotations: continuous and discrete hierarchical.

The discrete hierarchical annotations are derived from the Sequence Ontology (SO) and GFF3 work

The goal is to be able to store annotations using the GFF3 and SO conceptual model, although there is not necessarly a one-to-one mapping in Protobuf messages to GFF3 records.

The minimum requirement is to be able to accurately represent the current state of the art annotation data and the full SO model. Feature is the core generic record which corresponds to the a GFF3 record.

message Feature
Fields:
  • id (string) – Id of this annotation node.
  • name (string) – An optional name to provide for the feature.
  • gene_symbol (string) – The gene symbol the feature occurs on. This field may be replaced with a more generic representation in a future version.
  • parent_id (string) – Parent Id of this node. Set to empty string if node has no parent.
  • child_ids (string) – Ordered array of Child Ids of this node. Since not all child nodes are ordered by genomic coordinates, this can’t always be reconstructed from parent_id’s of the children alone.
  • feature_set_id (string) – Identifier for the containing feature set.
  • reference_name (string) – The reference on which this feature occurs (e.g. chr20 or X).
  • start (long) – The start position at which this feature occurs (0-based). This corresponds to the first base of the string of reference bases. Genomic positions are non-negative integers less than reference length. Features spanning the join of circular genomes are represented as two features one on each side of the join (position 0).
  • end (long) – The end position (exclusive), resulting in [start, end) closed-open interval. This is typically calculated by start + reference_bases.length.
  • strand (Strand) – The strand on which the feature is present.
  • feature_type (OntologyTerm) – Feature that is annotated by this region. Normally, this will be a term in the Sequence Ontology.
  • attributes (Attributes) – Name/value attributes of the annotation. Attribute names follow the GFF3 naming convention of reserved names starting with an upper cases character, and user-define names start with lower-case. Most GFF3 pre-defined attributes apply, the exceptions are ID and Parent, which are defined as fields. Additional, the following attributes are added: * Score - the GFF3 score column * Phase - the GFF3 phase column for CDS features.

Node in the annotation graph that annotates a contiguous region of a sequence.

message FeatureSet
Fields:
  • id (string) – The ID of this annotation set.
  • dataset_id (string) – The ID of the dataset this annotation set belongs to.
  • reference_set_id (string) – The ID of the reference set which defines the coordinate-space for this set of annotations.
  • name (string) – The display name for this annotation set.
  • source_uri (string) – The source URI describing the file from which this annotation set was generated, if any.
  • attributes (Attributes) – A map of additional Feature Set information.

A set of sequence features annotations.