Nomenclature

The following nomenclature may be used for the description of both Regulatory Fusions and Chimeric Transcript Fusions in the context of Categorical Gene Fusions or Assayed Gene Fusions as applicable. The nomenclature components are organized into three categories: Gene Components, Transcript Sequence Components, and Regulatory Nomenclature. These may be used interchangeably, in accordance with the below General Rules.

General Rules

All components are joined together by the double-colon (::) operator. Additional rules apply for sub-components of Regulatory Nomenclature.
When describing Chimeric Transcript Fusions, structural components are ordered in 5’ to 3’ orientation with respect to the transcribed gene product.
When describing Regulatory Fusions, the regulatory element is indicated first (e.g. reg_e@IGH::MYC).
When describing Chimeric Transcript Fusions by Junction Components (in lieu of full Transcript Segment Components), the 5’ fusion partner junction must be the first component, and the 3’ fusion partner junction must be the last component.
Throughout the nomenclature components, some information may be provided optionally. In these cases, the optional text is colored orange and may be omitted.

Gene Components

Gene components are used in coarse representation of gene fusions by constituent gene partners, and are generally aligned with previous recommendations on gene-gene fusion nomenclature as provided by HGNC [Bruford2021]. The most common of these is the Specific Gene Component, which is complemented by the Multiple Possible Gene Component (for Categorical Gene Fusions) and the Unknown Gene Component (for Assayed Gene Fusions).

Specific Gene Component

The syntax for a specific gene is as follows:

First use of a gene in a document: <Gene Symbol>(<Gene ID>)

Subsequent use in a document: <Gene Symbol>(<Gene ID>)

An example fusion using two Specific Gene Components:

BCR(hgnc:1014)::ABL1(hgnc:76)

Unknown Gene Component

The syntax for an unknown (typically inferred) gene component (used for Assayed Gene Fusions) is a ?.

An example fusion using an unknown gene component may be inferred from an ALK break-apart assay:

?::ALK(hgnc:427)

Multiple Possible Gene Component

The syntax for a multiple possible gene component (used for Categorical Gene Fusions) is a v.

An example fusion using a multiple possible gene component is the “ALK Fusions” concept as seen in biomedical knowledgebases (e.g. CIViC ALK Fusion, OncoKB ALK Fusions):

v::ALK(hgnc:427)

Transcript Sequence Components

Transcript sequence components are used in precise representation of gene fusions by sequence representations, and are designed for compatibility with the HUGO Gene Variation Society (HGVS) variant nomenclature. Primary among these components is the Transcript Segment Component, and the closely-related 5-prime and 3-prime Junction Components. Additional components are used to represent intervening sequences, provided as a stand-alone literal sequence (Linker Sequence Component) or as a sequence derived from a Genomic Location (Templated Linker Sequence Component).

Transcript Segment Component

The Transcript Segment Component explicitly describes a segment transcript sequence by start and end exons, and is represented using the following syntax:

<Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>_<end exon><+/- offset>

Offsets, if omitted, indicate that there is no offset from the segment boundary (which is often the case in gene fusions). For a full description on the use of exon coordinates and offsets, see Structural Elements.

Transcript segment components would be used, for example, to represent COSMIC Fusion 165 (COSF165) under the gene fusion nomenclature as follows:

ENST00000397938.6(EWSR1):e.1_7::ENST00000527786.6(FLI1):e.6_9

Junction Components

The 5-prime and 3-prime Junction Components represent only 5-prime and 3-prime junction locations, respectively, for Chimeric Transcript Fusions. These components contrast with the Transcript Segment Component which represents a full segment. As noted in the General Rules, these components must be used only as the beginning or ending components, respectively, for a fusion.

The syntax for these components follows:

5-prime Junction Component: <Transcript ID>(<Gene Symbol>):e.<end exon><+/- offset>

3-prime Junction Component: <Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>

Optional use of offsets have the same meaning as in the Transcript Segment Component.

Linker Sequence Component

The Linker Sequence Component is represented literally by DNA characters (A, C, G, T).

Linker Sequence Components would be used, for example, to represent COSMIC Fusion 1780 (COSF1780) under the gene fusion nomenclature as follows:

Using Transcript Segment Component: ENST00000305877.12(BCR):e.1_2::ACTAAAGCG::ENST00000318560.5(ABL1):e.2_11

Using Junction Components: ENST00000305877.12(BCR):e.2::ACTAAAGCG::ENST00000318560.5(ABL1):e.2

Templated Linker Sequence Component

The Templated Linker Sequence Component is represented by a genomic location and strand using the following syntax:

<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>(+/-)

Regulatory Nomenclature

In the description of gene fusions, at most one regulatory element component may be used to describe the fusion, and it must be designated first (see General Rules). However, regulatory components are complex data objects themselves, and may be comprised of multiple subcomponents which collectively describe the regulatory element of interest. This section specifies the nomenclature for defining regulatory elements, which may be used as a component in the broader description of Regulatory Fusions.

Class Subcomponent

Every regulatory element component begins with a description of the regulatory element class, which is typically an enhancer or promoter. This is designated as reg_e or reg_p, respectively. In rare cases, it may be necessary to represent other classes of regulatory elements within the INSDC regulatory class vocabulary, which may be specified using this syntax by appending the regulatory class name to reg_ as applicable (e.g. reg_response_element).

Feature ID subcomponent

A regulatory element may be described by reference to a registered identifier, such as the registered cis-regulatory elements from ENCODE. These are represented using the syntax:

_<reference id>

An example registered enhancer element is reg_e_EH38E1516972.

Only one of a Feature ID OR a Feature location subcomponent may be specified.

Feature location subcomponent

A regulatory element may be described by reference to a Genomic Location. These are represented using the syntax:

<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>

Only one of a Feature Location OR a Feature ID subcomponent may be specified.

Associated gene subcomponent

A regulatory element may also be described by reference to an associated gene. An associated gene is represented using the syntax:

First use of a gene in a document: @<associated gene symbol>(<associated gene ID>)

Subsequent use in a document: @<associated gene symbol>(<associated gene ID>)

An associated gene may be indicated in addition to, or in lieu of, a Feature ID subcomponent or Feature location subcomponent. If representing a regulatory element without an associated feature ID or feature location subcomponent, an associated gene subcomponent MUST be used. The associated gene subcomponent is always placed at the end of the regulatory element description.

References

Bruford2021: Bruford EA, et al., HUGO Gene Nomenclature Committee (HGNC) recommendations for the designation of gene fusions. Leukemia (October 2021). doi:10.1038/s41375-021-01436-6