D 2024-06-18T17:09:38.706 L Unsupported\sicu\scommand N text/x-markdown P 84fde19d5e65bdbd25b2fcabf7f1dc9d8b5f24a6025f1386a577be76a52bc7ce U apnadkarni W 3745 This is a proposal to add a `tcl::unsupported::icu` command to Tcl 9. The primary goal is to help with migration of Tcl 8 scripts to Tcl 9. A (much) lesser benefit is to be able to experiment with ICU functions for handling strings and encodings. ## Motivation There are two changes in Tcl 9 that together can prevent even sourcing Tcl 8 scripts. - Change in the `source` encoding command to use UTF-8 by default. - The introduction of the `strict` encoding profile for (most) I/O. The two changes above mean that even a single copyright (U+00A9) character in an otherwise ASCII file will prevent the script from being loaded if the file is in ISO8859-1 encoding as is common. In addition, text files containing non-ASCII characters that were written in `[encoding system]` in Tcl 8 may not be readable in Tcl 9 without an explicit `fconfigure -encoding` invocation. This is a particular problem on Windows as `[encoding system]` in Tcl 9 returns `utf-8` irrespective of the user's code page setting that was used in Tcl 8. The [tcl9migrate](https://github.com/apnadkarni/tcl9-migrate) tool is an on-going (as in do not try it yet!) effort to help users with the above as well as other incompatibilities like octal and tilde expansion. It includes both a static checker based on Nagelfar as well as a runtime shim to help with the data file encoding issues. However, it requires some functionality from ICU for this purpose. Hence this proposal. ## Specification The `tcl::unsupported::icu` command ensemble includes subcommands - `detect` to guess file encodings - `icuToTcl` and `tclToIcu` to map ICU names to Tcl encoding names and vice versa ``` icu detect ?DATA ?-all?? icu icuToTcl ICUENCODINGNAME icu tclToIcu TCLENCODINGNAME ``` With no arguments, `icu detect` returns names of all encodings known to ICU. If `data` is specified, it should be a binary string and the commands returns the ICU name of the most likely encoding. If the `-all` option is specified, it returns a list of possible ICU encoding names ordered by their confidence score (highest to lowest). An empty string is returned if there nothing matches. Some caveats: - Does not give good results for shorter strings for obvious reasons. - Best results if caller first checks for UTF-8 and then calls `icu detect` since a false positive is very unlikely for UTF-8 while ICU often mistakes UTF-8 for ISO8859-1. This is the strategy followed by the `tcl9migrate` package. - Additionally, when dealing with Windows file content, better to check for UTF-8, followed by current code page, then `icu detect`. The above commands wrap the lower level ICU API which is internal use only at the moment and not documented here. The command will be auto-loaded on first use. The ICU libraries will not be shipped with Tcl and the functionality will not be available on systems that do not have them installed. This is acceptable because the intended use is on developer systems used for porting scripts. On recent versions of Windows 10 and later, ICU is already present as part of the system libraries. ## Implementation Implementation follows the one already present in Tk except for differences in libraries loaded stemming from different API's used. Combining the two is for a later time. Like Tk, the build system is unaffected as the ICU libraries are loaded only at runtime. This is intentional though there is a performance cost associated with searching for the library. This is acceptable as it will only be incurred on use of the `icu` command. Code is in branch [apn-experiment-chardet](https://core.tcl-lang.org/tcl/timeline?r=apn-experiment-chardet). Z 1d5e4f859508000b2d30c889fc2b5325