Skip to content

Language Tagging

Languages in Rev79 are tagged with BCP 47 language tags. This page provides an overview of how Rev79 understands language tags, and how relationships between languages are inferred by comparing language tags.

Structure of Language Tags

Language tags consist of a series of "subtags" providing information about the language:

en-Latn-AU
^^-------- Primary subtag
   ^^^^--- Script subtag
        ^^ Country subtag

The syntax and meaning of subtags is defined in RFC 5646. Rev79 supports the subset of language tags defined by the following ABNF (taken from the RFC):

langtag    = language ["-" script] ["-" region] *("-" variant) ["-" privateuse]
           / privateuse

language   = 2*3ALPHA           ; shortest ISO 639 code
             ["-" extlang]      ; sometimes followed by
                                ; extended language subtags
           / 4ALPHA             ; or reserved for future use
           / 5*8ALPHA           ; or registered language subtag

extlang    = 3ALPHA             ; selected ISO 639 codes
             *2("-" 3ALPHA)     ; permanently reserved

script     = 4ALPHA             ; ISO 15924 code

region     = 2ALPHA             ; ISO 3166-1 code
           / 3DIGIT             ; UN M.49 code

variant    = 5*8(ALPHA / DIGIT) ; registered variants
           / (DIGIT 3(ALPHA / DIGIT))

privateuse = "x" 1*("-" (1*8(ALPHA / DIGIT)))

Rev79 treats language tags purely syntactically, and does not attempt to normalise or validate tags using the IANA registry. Nonetheless, to aid interoperability with other systems, care should be taken to ensure valid language/script/region subtags according to their relevant standards.

Private use subtag: HIS (ROLV codes)

Rev79 supports ROLV codes using the Harvest Information System extension subtag defined by Global Recordings Network. These tags must consist of the characters HIS followed by exactly five numbers (padded by leading zeroes, where necessary).

kmn-x-HIS01234
^^^----------- Awtuw language
    ^--------- extension subtag marker
      ^^^^^^^^ Kamnam variety

Inferring Relationships from Language Tags

In some circumstances, Rev79 will infer a relationship between two languages, so that work done on specific language tag (e.g. kmn-x-HIS01234) can be considered as work towards a more general language tag (e.g. kmn). This inference is syntactic, based on the contents of the language tag.

A language A is considered "more specific" than another language B if language A's language tag matches language B's language tag, according to the "extended filtering" process defined in RFC 4647. Broadly speaking, this considers each tag subtag of the language tag, and matches either if A and B have the same value (or a subset, for variant/privateuse sets) for the subtag, or if B does not specify the subtag.

Some examples:

  • kmn-x-HIS01234 is more specific than kmn
  • kmn-PG-x-HIS01234 is more specific than kmn
  • kmn-x-HIS01234 is not more specific than kmn-PG
  • en is not more specific than kmn