Tcl Improvement Proposals: 652.md at [496023b8bc]

File tip/652.md artifact cfb306b30b part of check-in 496023b8bc

# TIP 652:  Remove  "string is unicode" and Tcl_CharIsUnicode"
	Author:		Nathan Coulter <[email protected]>
	State:		Draft
	Type:		Project
	Vote:		Pending
	Created:	26-Dec-2022
	Tcl-Version:	8.7
	Tcl-branch:
	Vote-Summary:	
	Votes-For:	
	Votes-Against:	
	Votes-Present:	
-----

# Abstract and Specification

Remove `string is Unicode` introduced in [`TIP
597`](https://core.tcl-lang.org/tips/doc/trunk/tip/597.md) and add `encoding
encodable? ...`, which returns `1` if the corresponding `encoding convertto...`
command would succeed, and `0` otherwise.

Remove `Tcl_UniCharIsUnicode()`.



# Rationale

The only use of `[string is unicode]` is to determine whether a string can be
encoded into a unicode transformation format, either utf-8, utf-16, or utf-32.
.  Tcl has never needed a `[string is big5]`, `[string is shiftjs]` or any
other `[string is someencodinghere]`.  There is also no need for `[string is
unicode]`.  To determine whether a given string can be encoded into a given
encoding, it is sufficient to attempt to perform the encoding without doing the
extra work to return the encoded value.

`[string is unicode]` fails in its stated purpose.  According to TIP 597,

> The string is unicode command can be used to check if the "utf-8"/"utf-16"
> encodings would deliver valid output, ...

.  This is not true:

<pre>
set text \U03fffe
string is unicode $text;# -> 0
binary scan [encoding convertto utf-16 $text] H* hex
set hex ;# -> fdff
</pre>

The problem is that according to TIP 597, in addition to the surrogate
characters,  the return value is also `0` for the  66 noncharacters, U+??FFFE -
U+??FFFF and U+FDD0 - U+FDEF.  This means that `string is unicode` and
`Tcl_UniCharIsUnicode` can not be used to check whether the data could be
encoded into one of the Unicode encoding forms.

The Unicode specification makes it clear that noncharacters may be encoded into
an encoding form.  First, there is definition 79:

> A Unicode encoding form assigns each Unicode scalar value to a unique code unit
sequence.

Th specification then declares:

> To ensure that the mapping for a Unicode encoding form is one-to-one, all
> Unicode scalar values, including those corresponding to noncharacter code
> points and unassigned code points, must be mapped to unique code unit
> sequences. Note that this requirement does not extend to high-surrogate and
> low-surrogate code points, which are excluded by definition from the set of
> Unicode scalar values.


The [Private-Use Characters, Noncharacters & Sentinels FAQ](https://www.unicode.org/faq/private_use.html#nonchar6)
states,

>Q: Are noncharacters invalid in Unicode strings and UTFs?

>Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in
>any UTF. This can be seen explicitly in the table above, where every
>noncharacter code point has a well-formed representation in UTF-32, in UTF-16,
>and in UTF-8. An implementation which converts noncharacter code points between
>one UTF representation and another must preserve these values correctly. The
>fact that they are called "noncharacters" and are not intended for open
>interchange does not mean that they are somehow illegal or invalid code points
>which make strings containing them invalid.


As a response to issue [Tcl 9: "illegal byte
sequence" ?!](https://core.tcl-lang.org/tcl/info/17a1cb8d6e2a51bd)  all checks
for noncharacters were removed in [commit
cbaa5e70167db75b](https://core.tcl-lang.org/tcl/info/cbaa5e70167db75b).
Therefore, `[string is unicode]` is already  obsolete, having fallen behind the
reality of the implementation.  The only thing `string is unicode` now does is
check for surrogate code points.


`Tcl_UniCharIsUnicode()` is also not useful.  If some encoding functionality is
to be exposed at the C level, the equivalent of `encoding convertto` could be
provided.



# Implementation

Implementation will be provided as Unicode capabilities of Tcl are further
refined.


# Copyright

Copyright © 2023, Nathan Coulter.  All rights reserved.


# Support

The author of this TIP requests financial support for this and other free
software works.  Contact and payment information available at:

> https://wiki.tcl-lang.org/page/Poor+Yorick