TIP 573: Surrogates are invalid

Author:         Jan Nijtmans <[email protected]>
State:          Withdrawn
Type:           Project
Tcl-Version:    8.7
Tcl-Branch:     tip-573


In Tcl 8.5 and 8.6, it is possible to specify "\uD800" as valid character, and do reasonable things with it. Actually, U+D800 is not a valid Unicode code point, so any operation on such value is "undefined behavior". Surrogates are only allowed when Unicode characters are stored or transmitted as UTF16, and then a single Unicode character behaves as 2 characters in storage or on-line. However, when translating back to UTF-8, those 2 characters have to be merged back to a single Unicode character. They are inseparable.

Tcl has always allowed this, but it's wrong. The UtfToUtf encoder should translate all characters in the range \uD800 - \uDFFF to the replacement character \uFFFD. Also using the form \uD800 in scripts should always behave the same as if \uFFFD was specified. This way, any possible 'hack' attempt, trying to insert invalid Unicode in Tcl, can be mitigated rapidly.

One exception to this has to be made. When using the escape sequence \uD800\uDC00, so a valid combination of an upper and a lower surrogate, in a script, there is no harm in translating that to the intended Unicode code point. In Tcl 8.6 and 8.7 there is no other way than that for specifying Emoji. Allowing this, provides a upgrade path for existing scripts handling Emoji. Starting with Tcl 8.7, the "\UXXXXXX" escape sequence should be used for this.


This TIP proposes:


Since the code-points for surrogates are marked "invalid" in the Unicode standard, it should not be possible to use them in Tcl.


Implementation is in branch tip-573


See ticket 5dabd088ee.

This TIP is now withdrawn, in favor of TIP #601.


This document has been placed in the public domain.