TIP 573: Surrogates are invalid

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Tcl-Version:    8.7
Tcl-Branch:     tip-573

Abstract

In Tcl 8.5 and 8.6, it is possible to specify "\uD800" as valid character, and do reasonable things with it. Actually, U+D800 is not a valid Unicode code point, so any operation on such value is "undefined behavior". Surrogates are only allowed when Unicode characters are stored or transmitted as UTF16, and then a single Unicode character behaves as 2 characters in storage or on-line. However, when translating back to UTF-8, those 2 characters have to be merged back to a single Unicode character. They are unseparatable.

Tcl has always allowed this, but it's wrong. The UtfToUtf encoder should translate all characters in the range \uD800 - \uDFFF to the replacement character \uFFFD. Also using the form \uD800 in scripts should always behave the same as if \uFFFD was specified. This way, any possible 'hack' attempt, trying to insert invalid Unicode in Tcl, can be mitigated rapidly.

One exception to this has to be made. When using the escace sequence \uD800\uDC00, so a valid combination of an upper and a lower surrogate, in a script, there is no harm in translating that to the intended Unicode code point. In Tcl 8.6 and 8.7 there is no other way than that for specifying Emoji. Allowing this, provides a upgrade path for existing scripts handling Emoji. Starting with Tcl 8.7, the "\UXXXXXX" escape sequence should be used for this.

Specification

This TIP proposes:

Rationale

Since the code-points for surrogates are marked "invalid" in the Unicode standard, it should not be possible to use them in Tcl.

Implementation

Implementation is in branch tip-573

Discussion

See ticket 5dabd088ee.

Copyright

This document has been placed in the public domain.