Tcl Improvement Proposals: Artifact [866ff5b79f]

Artifact 866ff5b79f386e66333c0fe0f46365150a61548a6609faa874016ba55a93e691:

File tip/389.tip — part of check-in [abc545f77b] at 2016-03-17 14:06:22 on branch trunk — Updated title to reflect external standard changes (user: dkf size: 4230)
TIP:            389
Title:          Full support for Unicode 8.0 and later
Version:        $Revision: 1.4 $
Author:         Jan Nijtmans <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Created:        23-Aug-2011
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7

~ Abstract

This TIP proposes to add full support for all characters in Unicode 8.0+,
inclusive the characters >= '''U+010000'''.

~ Summary

In order to extend the range of the characters to more than 16 bits, the type
''Tcl_UniChar'' is not big enough any more to hold all possible characters.
Changing the type of ''Tcl_UniChar'' to a 32-bit quantity is not an option, as
it will result in a binary API incompatibility.

The solution proposed in this TIP is to keep ''Tcl_UniChar'' a 16-bit
quantity, but to increase the value of TCL_UTF_MAX to 4 (from 3). Any
conversions from UTF-8 to ''Tcl_UniChar'' will convert any valid 4-byte UTF-8
sequence to a sequence of two Surrogate characters. All conversions from
UTF-16 to UTF-8 will make sure that any High Surrogate immediately followed by
a Low Surrogate will result in a single 4-byte UTF-8 character.

This can be done in a binary compatible way: No source code of existing
extensions need to be modified. As long as no characters >= '''U+010000''' or
Surrogates are used, all functions provided by the Tcl library will function
as before. There are few functions which currently return a value of type
''Tcl_UniChar'', those will be modified to return an int in stead.

~ Rationale

As Unicode 8.0, and future Unicode versions, will supply more and more
characters outside the 16-bit range, it would be useful if Tcl supports
that as well.

~ Specification

This document proposes:

 * Change the functions ''Tcl_UniCharToUtf'' and ''UnicodeToUtfProc'' such
   that when they are fed with a valid High Surrogate followed by a Low
   Surrogate, the result will be a single 4-byte UTF-8 character.

 * Change the functions ''Tcl_UtfToUniChar'' and ''UtfToUnicodeProc'' such
   that when they are fed with a valid 4-byte UTF-8 character, the first call
   will return a High Surrogate character, and the next call will return a Low
   Surrogate character.

 * The following functions, which currently return a Tcl_UniChar, will be
   changed to return an int instead:

 > * ''Tcl_UniCharAtIndex''

 > * ''Tcl_UniCharToLower''

 > * ''Tcl_UniCharToTitle''

 > * ''Tcl_UniCharToUpper''

 > * ''Tcl_GetUniChar''

 * Extend tclUniData.c to include all Unicode 8.0 characters up to
   '''U+02FA20'''.  A special case will be made for the functions
   ''Tcl_UniCharIsGraph'' and ''Tcl_UniCharIsPrint'' for the characters in the
   range '''U+0E0100''' - '''U+0E01EF''', otherwise it would almost double the
   Unicode table size.

~ Compatibility

As long as no Surrogates or characters >= '''U+010000''' are used, all
functions behave exactly the same as before. The only way that
''Tcl_UniCharToUtf'' can produce a 4-byte output is when Surrogates or
characters >= '''U+010000''' are used.

Extension that want to be compatible with any Tcl version, can include tcl.h
as follows:

|#define TCL_UTF_MAX 4
|#include <tcl.h>

or they can call the C compiler with the additional argument -DTCL_UTF_MAX=4,
in order to be sure that UTF-8 representations of length 4 can be
handled. This way, the extension can be used with any Tcl version, whether it
supports Surrogates or not.

Apart from this, it is advisable to initialize the variable where the chPtr
argument from ''Tcl_UtfToUniChar'' points to, as this location is used to
remember whether the High Surrogate is already produced or not. Not doing so
when the first character of a string is a character > '''U+010000''' might
result in a Low Surrogate character only. This danger, however unlikely, only
exists for the first character in a string, and it only occurs when the
(random) value is exactly equal to the expected High Surrogate.

~ Reference Implementation

A reference implementation is available at http://core.tcl.tk/tcl in branch
tip-389-impl

~ Copyright

This document has been placed in the public domain.