Internet Architecture Board

RFC2850

IAB Statement: “The interpretation of rules in the ICANN gTLD Applicant Guidebook.”

Executive Summary

Recently, concerns about potential conflicts between interpretations of RFC 1123 and real-world practice with provisioning of labels in the root zone of the Domain Name System (DNS) have come to the attention of the IAB. These concerns arise because some of the RFCs referenced within the ICANN gTLD Applicant Guidebook have not yet been updated as efforts to support Internet internationalization have progressed.

While there is work in progress relating to this issue, additional time and effort will be required before updated RFCs can be published, definitively addressing the open questions. For example, to provide input to the development of these specifications, it is expected that documentation of existing implementations as well as user and application interfaces will be required.

While the IAB and IESG consider how to further the development of of definitive specifications, interim guidance is needed. So as not to block progress toward the provisioning of additional generic TLDs, the IAB is publishing this statement.

Since once a Top Level Domain (TLD) is allocated it will be very difficult to subsequently withdraw it, and since allocation of a problematic TLD would make it difficult to deny other problematic requests, the IAB strongly recommends that DNS registries, and ICANN in particular follow the most conservative interpretation of ”will be alphabetic” (as described in the Recommendation section below), until definitive specifications can be developed. This statement will remain in force until the necessary RFCs have been published.

Introduction

While to a substantial degree policy matters related to the DNS lie with ICANN, the base syntax constraint for TLD and Second Level Domain (SLD) labels, which represent a baseline upon which other DNS policy builds, is a product of the IETF. However, until the IETF develops updated specifications, the IAB provides the following interpretation of RFCs referenced in the ICANN gTLD Applicant Guidebook. The IAB observes that section 1.3 of the ICANN gTLD Applicant Guidebook (version 2012-01-11) contains requirements for Internationalized Domain Names (IDNs), including:

    "The prefix and string together must conform to all
     requirements for a label that can be stored in the DNS
     including conformance to the LDH (host name) rule
     described in RFC 1034, RFC 1123, and elsewhere."

The IAB interprets this requirement as follows:

Top-Level and Second-Level Domain (TLD and SLD) labels in the Domain Name System (DNS) [RFC1034] [RFC1035] are not required by the DNS protocols to adhere to any syntax beyond that for any other DNS label: e.g. any binary string whatsoever can be used at any point in the DNS [RFC2181]. However there are good engineering reasons, some of them embedded with application protocol implementations, for requiring the more restricted Letter-Digit-Hyphen(LDH) syntax. This means that while any label is legal from a DNS perspective, non-LDH labels can result in DNS names that are in practice unusable within applications (e.g., email, web, etc.)

In conversations with subject matter specialists, we have observed different interpretations of one of the specific TLD requirements set by [RFC1123] (see the ‘will be alphabetic’ discussion below). In interpreting that [RFC1123] requirement, it should be understood that it reflects policy choices made by the IAB in pre-ICANN times, as well as application-layer considerations relating to human factors.

Technical Detail

[RFC1034] provides the definition of a preferred name syntax in section 3.5 with the following motivation:

  "However, when assigning a domain name for an object, the prudent
   user will select a name which satisfies both the rules of the
   domain system and any existing rules for the object, whether these
   rules are published or implied by existing programs."

The document continues with an example referring to names as used by the Internet Mail application and the rules for the old HOSTS.TXT file (see [RFC952] for context). The section finishes with the specification for LDH in Abstract Syntax notation. [RFC1123] reaffirms this definition, but makes one change to the syntax:

 "The syntax of a legal Internet host name was specified in RFC-952
  [DNS:4].  One aspect of host name syntax is hereby changed: the
  restriction on the first character is relaxed to allow either a
  letter or a digit.  Host software MUST support this more liberal
  syntax."  [Section 2.1]

In addition, the DISCUSSION section of Section 2.1 says:

 "However, a valid host name can never have the dotted-decimal form
  #.#.#.#, since at least the highest-level component label will be
  alphabetic."  [Section 2.1]

While neither [RFC0952] nor [RFC1123] explicitly states the reasons for these restrictions, avoiding confusion for humans is an important and prudent engineering consideration. For instance, [RFC1123] suggests that one of the reasons was to prevent confusion between dotted-decimal IPv4 addresses and host domain names.

That decision has been interpreted as normative and is known to have found its way into specifications, implementations, or both. For instance [RFC1738], in its description of the “host” part of a Uniform Resource Locator, declares:

 'The rightmost domain label will never start with a digit, though,
 which syntactically distinguishes all domain names from the IP
 addresses.'  [Section 3.1]

Summarizing, the restriction that ‘the highest-level component label will be alphabetic’ has been assumed in some deployed software, and therefore caution should be exercised when contemplating any changes to that rule.

The strictest interpretation of ‘will be alphabetic’ would exclude valid A-labels in the root and should be expanded to permit the xn-- type labels.

Specification

We believe that a normative interpretation of “will be alphabetic” needs to be specified. There is at least one work in progress
[draft-liman-tld-names] attempting to provide such a specification.

We assert that the following requirements should apply to any such specification:

  • 1. TLD labels used in DNS names MUST be clearly distinguishable from IPv4 and IPv6 address literals, both in wire and presentation format. This is the most liberal requirement.
  • 2. Internationalized strings encoded according to the IDNA2008 specification [RFC5891] [RFC5892] MUST be permitted in general, but the set of Unicode characters which are permissible in TLD U-labels MAY be constrained in order to satisfy other requirements. The most conservative interpretation of this constraint is the “only alphabetic requirement”.
  • 3. Currently existing TLD labels MUST be permitted.
  • 4. The syntax definition SHOULD be compatible with existing assumptions by application developers and users, to the extent that those assumptions can be determined.

Indeed, when translating the [RFC1123] requirement to a specification, pure technical as well as human factors (requirement 4) need to be taken into account. The resulting specification is likely to lie between two extremes: the most conservative interpretation, based on the IDNA specifications; and the most liberal interpretation based only on known technical limitations, ignoring human factors.

The most conservative interpretation of the constraints set by [RFC1123] is that a TLD is ‘alphabetic’ only. The specification of this interpretation in the IDNA context would not allow numbers, symbols such as punctuation marks, or diacritics in either A-labels or U-labels, except in A-labels that begin with ‘xn--’.

The most liberal interpretation of [RFC1123] would be a specification whereby requirement 4 does not introduce limitations that are not already set by the other three requirements. Such a specification would define the input values in such a way that the final label cannot contain an element that can be interpreted as part of an IP address. Examples of those are labels consisting of strings that represent integers; this includes labels that contain numbers expressed in decimal, octal, or hexadecimal but also U-labels that are likely to be interpreted as numbers by OS or application libraries when entered in the user interface.

To provide the specification for the most liberal interpretation, research is needed as to how user input is eventually interpreted as purely numeric – and therefore possibly as an IP address — in legacy network stacks and applications.

A final specification is likely to end up between these two extremes as it also takes human factors and implementation issues from requirement 4 into account. For example, combining Unicode characters from the general letter category with the general mark category (specifically the ‘Mark, spacing combined’ category [Unicode60]) might cause script dependencies in rendering libraries which creates complexities that might eventually result in increased likelihood of confusion among strings that would not be confused if rendered correctly. It is not a given that these problems will happen but is currently a subject of study among implementers and subject matter specialists.

Recommendation

While a detailed specification is pending, we recommend DNS registries, and ICANN in particular, follow the most conservative approach with respect to the constraints set by other requirements. That is, be conservative regarding issues related to usability, confusability, stability, etc. The most conservative approach is a specification that reflects the ‘will be alphabetic’ text in RFC 1123 within an internationalized context:

  • 1. the DNS-label is a IDNA-valid string according to [RFC5890];
  • 2. the derived property value of all code points, as defined by [RFC5890], is PVALID;
  • 3. the general category of all code points, is one of {Ll, Lo, Lm}. [Unicode60]

This conservative constraint does not prevent useful mnemonics. However, it is clear that not including the Mn and Mc general capacities at this time limits the possibility of full expression for certain South and Southeast Asian scripts. Nevertheless, introducing labels that can cause interoperability issues and/or set false expectations from users of the Internet will be more damaging than following a conservative approach as the Internet infrastructure does not have an ‘undo’ button.

The IAB, February 8, 2012.

References

[draft-liman-tld-names]: L-J. Liman and J. Abley “Top Level Domain Name Specification“, http://tools.ietf.org/html/draft-liman-tld-names-06 (work in progress, last checked Jan 24, 2012)

[RFC1034] Mockapetris, P., “Domain names – concepts and facilities“, STD 13, RFC 1034, November 1987.

[RFC1035] Mockapetris, P., “Domain names – implementation and specification“, STD 13, RFC 1035, November 1987.

[RFC1123] Braden, R., “Requirements for Internet Hosts – Application and Support”, STD 3, RFC 1123, October 1989.

[RFC2181] Elz, R. and R. Bush, “Clarifications to the DNS Specification“, RFC 2181, July 1997.

[RFC5890] Klensin, J., “Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework“,
RFC 5890, August 2010.

[RFC5891] Klensin, J., “Internationalized Domain Names in Applications (IDNA): Protocol“, RFC 5891, August 2010.

[RFC5892] Faltstrom, P., “The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)“, RFC 5892, August 2010.

[Unicode60] The Unicode Consortium. The Unicode Standard, Version 6.0.0, defined by: “The Unicode Standard, Version 6.0.0“, (Mountain View, CA: The Unicode Consortium, 2011. ISBN 978-1-936213-01-6).