D 2024-06-18T17:09:38.706
L Unsupported\sicu\scommand
N text/x-markdown
P 84fde19d5e65bdbd25b2fcabf7f1dc9d8b5f24a6025f1386a577be76a52bc7ce
U apnadkarni
W 3745
This is a proposal to add a `tcl::unsupported::icu` command to Tcl 9.
The primary goal is to help with migration of Tcl 8 scripts to Tcl 9.
A (much) lesser benefit is to be able to experiment with ICU functions
for handling strings and encodings.

## Motivation

There are two changes in Tcl 9 that together can prevent even sourcing
Tcl 8 scripts.

- Change in the `source` encoding command to use UTF-8 by default.

- The introduction of the `strict` encoding profile for (most) I/O.

The two changes above mean that even a single copyright (U+00A9) character
in an otherwise ASCII file will prevent the script from being loaded if
the file is in ISO8859-1 encoding as is common.

In addition, text files containing non-ASCII characters that were written
in `[encoding system]` in Tcl 8 may not be readable in Tcl 9 without an
explicit `fconfigure -encoding` invocation. This is a particular problem
on Windows as `[encoding system]` in Tcl 9 returns `utf-8` irrespective
of the user's code page setting that was used in Tcl 8.

The [tcl9migrate](https://github.com/apnadkarni/tcl9-migrate) tool is
an on-going (as in do not try it yet!) effort to help users with the above
as well as other incompatibilities
like octal and tilde expansion. It includes both a static checker based
on Nagelfar as well as a runtime shim to help with the data file encoding
issues. However, it requires some functionality from ICU for this
purpose.

Hence this proposal.

## Specification

The `tcl::unsupported::icu` command ensemble includes subcommands

- `detect` to guess file encodings
- `icuToTcl` and `tclToIcu` to map ICU names to Tcl encoding names and vice versa

```
icu detect ?DATA ?-all??
icu icuToTcl ICUENCODINGNAME
icu tclToIcu TCLENCODINGNAME
```

With no arguments, `icu detect` returns names of all encodings known to ICU.
If `data` is specified, it should be a binary string and the commands
returns the ICU name of the most likely encoding.
If the `-all` option is specified, it returns a list of possible ICU encoding
names ordered by their confidence score (highest to lowest).
An empty string is returned if there nothing matches.

Some caveats:

- Does not give good results for shorter strings for obvious reasons.

- Best results if caller first checks for UTF-8 and then calls `icu detect`
since a false positive is very unlikely for UTF-8 while ICU often mistakes
UTF-8 for ISO8859-1. This is the strategy followed by the `tcl9migrate`
package.

- Additionally, when dealing with Windows file content, better to check
for UTF-8, followed by current code page, then `icu detect`.

The above commands wrap the lower level ICU API which is internal use
only at the moment and not documented here.

The command will be auto-loaded on first use.

The ICU libraries will not be shipped with Tcl and the functionality
will not be available on systems that do not have them installed.
This is acceptable because the intended use is on developer systems
used for porting scripts. On recent versions of Windows 10 and later,
ICU is already present as part of the system libraries.

## Implementation

Implementation follows the one already present in Tk except for
differences in libraries loaded stemming from different API's used.
Combining the two is for a later time.

Like Tk, the build system is unaffected as the ICU libraries are
loaded only at runtime. This is intentional though there is a performance
cost associated with searching for the library. This is acceptable
as it will only be incurred on use of the `icu` command.

Code is in branch [apn-experiment-chardet](https://core.tcl-lang.org/tcl/timeline?r=apn-experiment-chardet).




Z 1d5e4f859508000b2d30c889fc2b5325