Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Informative
Vote:		No voting

Abstract

This document explains how strings are represented in Tcl 8.6 resp. 8.7 (and 9.0), and how that affects the Tcl language. When migrating Tcl scripts from 8.6 to 8.7 (or 9.0), some commands slightly change behavior.

This migration guide is based on TIP's #389, #497, #542, #619 ...

This TIP is not separately voted on. It will be finalized after Tcl 9.0 is officially released.

Dual value representation in Tcl.

Tcl strings can internally be represented in two (or three) ways, as a sequence of bytes or as a "Unicode string" (a sequence of Tcl_UniChar's). This internal representation differs in the various Tcl versions:

Tcl 8.5, bytes: UTF-8 (up to U+FFFF), string: UCS-2
Tcl 8.6, bytes: CESU-8, string: UTF-16
Tcl 8.7, bytes: WTF-8, utf32string: UTF-32, string: UTF-16
Tcl 9.0, bytes: WTF-8, string: UTF-32

The difference in behavior, comparing the different Tcl versions, can be explained by this difference in internal representation of strings.

Remark: Actually, Tcl uses a variation of UTF-8/CESU-8/WTF-8 in which the NULL character is represented as two bytes 0xC0 0x80. This variation is known as Modified UTF-8. Tcl uses the same modification for CESU-8 and WTF-8 as well. Since this modification holds for all Tcl versions, it's not further handled in this document.

Remark 2: Since utf-8 is the system encoding on most modern UNIX (and MacOS) systems, the examples below can only be used on UNIX and MacOS, not when using tclsh interactive on Windows (for example).

encoding "utf-8"

In Tcl 8.6, Emoji are represented internally as two codepoints. So

$ tclsh8.6
% encoding convertfrom utf-8 \xF0\x9F\xA4\x9D
🤝
% string length [encoding convertfrom utf-8 \xF0\x9F\xA4\x9D]
2

Since Tcl 8.7/9.0 switches to UTF-32 for counting the string length:

$ tclsh8.7
% encoding convertfrom utf-8 xF0x9FxA4x9D
🤝
% string length [encoding convertfrom utf-8 \xF0\x9F\xA4\x9D]
1

Tcl 8.5 (and also Tcl 8.6 earlier than 8.6.10) was not able to handle this at all:

$ tclsh8.5
% encoding convertfrom utf-8 \xF0\x9F\xA4\x9D
ð¤ (control characters \x9F and \x9D are not printable)
% string length [encoding convertfrom utf-8 \xF0\x9F\xA4\x9D]
4

Any 4-byte utf-8 sequence was simply converted to those separate 4 bytes. This made the use of Emoji practially impossible when using Tcl 8.6.9 or earlier.

escaping

In Tcl 8.6, using a system encoding different from UTF-8, you cannot use Emoji directly in scripts. The only portable way to use them is the \u???? construct:

$ tclsh8.6
% puts \uD83E\uDD1D
🤝
% string length \uD83E\uDD1D
2

In Tcl 8.7 this is still supported:

$ tclsh8.7
% puts \uD83E\uDD1D
🤝
% string length \uD83E\uDD1D
1

Note that this escape sequence appears to produce 2 symbols, a higher and a lower surrogate. But surrogate pairs are non-conforming in WTF-8, so they are joined into a single 4-byte sequence right from the start in Tcl8.7.

In Tcl 9.0 it's not possible to do this any more:

$ tclsh9.0
% puts \uD83E\uDD1D
error writing "stdout": illegal byte sequence

Better is to use the 🤝 character directly:

$ tclsh8.7
% puts 🤝
🤝
% string length 🤝
1

string compare

Since in Tcl 8.6, Emoji are represented internally as two codepoints:

$ tclsh8.6
% string compare 🤝 豈
-1

But in Tcl 8.7 and 9.0:

$ tclsh8.7
% string compare 🤝 豈
1

The reason for this is that 🤝 (U+1F91D) is internally represented as two code-points (U+D83E U+DD1D) while 豈 is represented as a single code point (U+F900). The "string compare" simply compares all code points from left to right, and concludes that 🤝 is smaller than 豈, which - in unicode sense - (U+1F91D > U+F900) is not correct. This is corrected in Tcl 8.7 and 9.0.

Conclusion: When Tcl8.6 strings contain both Emoji and characters between U+E000 and U+FFFF (mostly Private Use, but also CJK Compatibility Ideographs, Alphabetic presentation forms, Arabic presentation forms, Variation selectors, Vertical forms, Combining half-marks, CJK Compatibility forms, Small form variants, Halfwidth and Fullwith forms, Specials) string comparison might not give what you expect.

string index / string length

In Tcl 8.6:

$ tclsh8.6
% string length 🤝
2

But in Tcl 8.7/9.0:

$ tclsh8.7
% string length 🤝
1

Since the "string length" and the "string index" command are related, we cannot change one without taking the other into account. Therefore, "string index" behaves differently in the different Tcl version. For example:

In Tcl 8.6

$ tclsh8.6
% string index 🤝🤡 0
� (U+D83E)
% string index 🤝🤡 1
� (U+DD1D)

In Tcl 8.7

$ tclsh8.7
% string index 🤝🤡 0
🤝
% string index 🤝🤡 1
🤡

This allows looping through the string using "string length" in combination with "string index".

In Tcl 8.7/9.0 all is OK: Since "string length 🤝" is 1, no special handling is needed when indexing strings.

split

Since Emoji are not supposed to be split into surrogates:

$ tclsh8.6 (at least 8.6.11)
% split 🤝🤡 {}
🤝 🤡

Earlier Tcl versions (even up to 8.6.9):

$ tclsh8.5
% split 🤝🤡 {}
ð  ¤  ð  ¤ ¡

In Tcl 8.6.10 it was partially fixed:

$ tclsh8.6
% split 🤝🤡 {}
� � � � (U+D83E U+DD1D U+D83E U+DD21)

This means that - starting with Tcl 8.6.11 - "split" can be used to iterate over a string, respecting correct border for Emoji. But it could have unexpected effects. For example the "tcl-telegram" app has the following function to convert a Tcl string to json form:

# Convert TCL string to proper JSON string
proc jString {str} {
	set result ""
	# json::write does escaping for 8-bit characters and adds quotes, but doesn't handle unicode
	set str [json::write string [subst -nocommands -novariables $str]]
	# Convert everything non 8-bit to \uXXXX sequences
	foreach char [split $str {}] {
		scan $char %c code
		if {$code > 127} {
			append result [format "\\u%04.4x" $code]
		} else {
			append result $char
		}
	}
	return $result
}

Originally this function couldn't handle Emoji (since Tcl up to 8.6.9 couldn't). In Tcl 8.6.10 it started working for Emoji (since json expects Emoji to be converted to surrogate-pairs first). Starting with Tcl 8.6.11 it should (finally) be written as follows:

# Convert TCL string to proper JSON string
proc jString {str} {
	set result ""
	# json::write does escaping for 8-bit characters and adds quotes, but doesn't handle unicode
	set str [json::write string [subst -nocommands -novariables $str]]
	# Convert everything non 8-bit to \uXXXX sequences
	foreach char [split $str {}] {
		scan $char %c code
		if {$code > 65535} {
			# split $code into surrogates first
			append result [format "\\u%04.4x\\u%04.4x" \
				[expr {(($code-0x10000)>>10)+0xD800}] [expr {(($code-0x10000)&0x3FF)+0xDC00}]]
		} elif {$code > 127} {
			append result [format "\\u%04.4x" $code]
		} else {
			append result $char
		}
	}
	return $result
}

This version works for Tcl 8.6.10 too, and it will continue to work for Tcl 8.7 and 9.0.

Copyright

This document has been placed in the public domain.

TIP 600: Migration guide for Tcl 8.6/8.7/9.0