Author: Ashok P. Nadkarni <[email protected]>
Tcl-Version: 9.0.2
State: Draft
Type: Project
Created: 2025-04-11
Abstract
This TIP proposes to remove the activeCodePage=UTF-8
entry from tclsh and wish
Windows manifests and implement the equivalent behavior internal to Tcl. A new
command and option is additionally proposed but existing Tcl 9 behavior is not
changed.
Specification
The ActiveCodePage
setting will be removed from tclsh.exe.manifest.in
and wish.exe.manifest.in
.
The Tcl_GetEncodingNameFromEnvironment
will be modified to return
encoding as per the GetACP
call for Windows builds prior to 18362
and utf-8
for later builds. This preserves compatibility with Tcl
9.0/9.0.1 despite the changes to the manifests. Non-Windows platforms
are unaffected.
A new API Tcl_GetEncodingNameForUser
will be added. On Windows platforms, it
will always return the encoding as per the GetACP
call irrespective of the
Windows build number. On non-Windows platforms it maps to
Tcl_GetEncodingNameFromEnvironment
. Correspondingly, a new command
encoding user
taking no arguments will be added on all platforms and
will return the result of Tcl_GetEncodingNameForUser
.
const char *Tcl_GetEncodingNameForUser(Tcl_DString *bufPtr);
encoding user
For compatibility reasons, the Tcl_GetEncodingNameForUser
function will not
be public via stubs in 9.0.2 but will be public in 9.1.
The exec
command will get a new option -encoding
that allows the caller to
specify the encoding to be used the result of the command. If the option is not specified,
the value specified in the TCL_EXEC_ENCODING
environment variable is used.
The value of this variable should be either the name of a Tcl encoding or user
.
In the latter case, the encoding used is that returne by Tcl_GetEncodingNameForUser
.
If TCL_EXEC_ENCODING
is not defined in the environment, the option defaults to
that returned by encoding system
. Defaulting to encoding
user
actually works much better in practice with Windows programs as they
generally use the user code page and not UTF-8. However that would break
compatibility with 9.0.
Note: The -encoding
option only applies to the result of the exec
command and not any redirections, in particular the <<
redirection.
Question: is the TCL_EXEC_ENCODING
environment variable desirable? It
allows the user to not have to constantly specify -encoding at the command line.
On the other hand, it is one more configuration knob.
Question: if the answer to the above question is Yes, should it
also apply to the command pipes created with open
? My inclination is open
should be left out of this TIP as one can already configure encoding for
opened pipes with fconfigure
.
In order to avoid issues with ANSI encodings, Tcl and extensions should continue to use the Unicode Windows API's.
Rationale
Why remove the activeCodePage manifest setting
The presence of this setting breaks
- binary Tcl extensions using shared libraries, including some Windows API's
- common uses of exec
- MinGW runtime compatibility
- application data sharing
There is no workaround for some of these issues without recompiling the Tcl shells.
See the Background section for the details.
Moreover, the motivation for adding this setting is not clear as it adds no
benefit at all to either tclsh
or wish
. That setting is intended for
applications that use the narrow character (ANSI) Windows API's. Since Tcl/Tk
uses the wide character Unicode API's they provide no benefit to Tcl itself
while breaking shared libraries that use ANSI API's but do not support UTF-8.
It is also to be noted that, so far as I can see, the manifest changes the behavior of a public API and command to ignore the user's system configuration of encodings without any TIP. (Apologies if it was in fact TIP'ed).
Removal of the manifest setting will fix the shared libraries using ANSI API's.
Why change Tcl_GetEncodingNameFromEnvironment
Removing the manifest setting reverts Tcl behaviour (on the affected Windows builds)
to be consistent with prior Windows builds, other platforms and Tcl 8 as
encoding system
will reflect the user's registry setting for encodings.
However, since 9.0.0,1 have already shipped, reverting back to user settings in 9.0.2 would mean, non-ASCII files written in 9.0.0,1 would not be readable in 9.0.2.
One possibility is to ignore this issue and document the incompatibility. This does not seem very friendly. Furthermore, at some point in the future, when Tcl makes UTF-8 the default encoding on all platforms (see Discussion), compatibility will be broken again which is really unpalatable.
The TIP therefore proposes that Tcl_GetEncodingNameFromEnvironment
always return utf-8
on the affected Windows builds.
The key difference with respect to the current implementation issue
that this does not impact extensions that call GetACP
solving the
first issue listed above, or using MingW msvcrt builds.
However, the issue with exec
and application data sharing remain,
which leads to ...
Why add -encoding
option to exec
Hard coding Tcl_GetEncodingNameFromEnvironment
to utf-8
does
not resolve the problem with exec
of programs that use the
user's code page settings. Further, asking users to switch from
the convenience of exec
to open
so the encodings can be
fconfigure-ed is not very friendly.
Adding an -encoding
option to exec
that allows the user
to specify the encoding used in pipes makes this a little easier.
Why add Tcl_GetEncodingNameForUser
and encoding user
Adding the -encoding
option to exec entails the user / application
knowing the user's settings. Expecting the user to look up
the Windows registry and then map the numeric value to the
appropriate Tcl encoding name (a mapping which is undocumented)
is unreasonable. The encoding user
command would encapsulate
this so user could just exec -encoding [encoding user]
instead.
Discussion
Background
Agreement on the encoding used is necessary any time that text data is shared irrespective of whether the sharing parties are different applications, the application and the system, or even different components of a single application. The manner of sharing may be through file content, network, function call arguments, clipboard, COM etc.
In some cases, for example (relatively) modern protocols like HTTP, the encoding in use is either explicitly passed as part of the protocol, or is defined in the protocol specification. There is no ambiguity in such cases.
In many cases however, there is no such explicit specification and encoding of shared text data depends on platform convention.
On Unix platforms, applications
assume locale information from environment variable LC_ALL
, LANG
and friends.
When storing a UTF-8 encoded file name sent over HTTP for example, the name is
encoded as the byte sequence using the encoding specified by this locale. On most
modern Linux systems, the locale defaults to UTF-8.
On Windows systems, the situation is a little more complicated.
The locale preferences are stored in the Windows registry and include
both system-wide and user settings. The encoding code page can be
retrieved with the GetACP
Windows API call. A further complication is
that the Windows API comes in two forms: narrow (ANSI) and wide (Unicode).
The latter expects data, such as file names, passed to it to
be encoded in UTF-16. There is no ambiguity as such.
The ANSI API expects any data passed to it to be in the encoding
specified by user code page. In the HTTP example, the UTF-8 encoded
file name must be encoded into UTF-16 if passed to the Unicode CreateFileW
API or to the encoding returned by GetACP
if passed to the ANSI CreateFileA
API.
Note the code page setting has no impact on applications that use the Unicode Windows API.
For data to be sensibly shared when there is no explicit mechanism to negotiate or otherwise determine the encoding, this platform convention needs to be adhered to.
The activeCodePage
manifest entry
Microsoft introduced
the activeCodePage
setting in Windows executable manifests as an aid to
applications that use ANSI API's. The presence of this setting results in
the GetACP
call always returning UTF-8
irrespective of the user's actual
code page setting. The intent was to make it easier for applications that use
ANSI API's to support the full Unicode range. It is not useful for
applications, like tclsh
and wish
, which use the Unicode API's. Further,
Microsoft warns that not all Windows API's support this UTF-8 code page.
The purpose of adding this manifest entry for tclsh
and wish
, given that
they use Unicode API's, was not TIP'ed and is unclear. However, it causes
breakage as detailed next.
As a point of interest, on my two Windows 10 and 11 systems with most major
applications installed, strings
shows exactly two programs with this setting -
tclsh
and wish
. The reader may interpret this data point as they wish. (The
latest version of R, which I do not have, apparently does use this setting. More
on that later.)
So what is the problem ...
The manifest entries in tclsh
and wish
cause the following failures and
issues stemming from two root causes:
components loaded into
tclsh/wish
(e.g. extensions) that use ANSI API's see UTF-8 as the code page because of the manifest, but cannot actually handle UTF-8 encodings, often due to assumptions about maximum multibyte encoding lengths. An example is components built with MingW64 gcc using the msvcrt runtime. Other cases include DLL's that access registry values using ANSI API's, displaying dialogs using Windows GDI (which not UTF-8 compatible without a experimental registry flag) etc.external applications fail to exchange data with tclsh due to mismatched encodings. because the application uses the user code page while Tcl hardcodes UTF-8. While this includes data exchange over (e.g. via
exec
) using pipes, more serious cases are extensions using ANSI API's.
The TPC benchmarking failure reported in the
core mailing list
after updating Tcl 8.6 to 9.0 stemmed from mismatched code pages.
This latter (DB2-like) failure is particularly treacherous as cause of
failure is not apparent as the same DLL works fine
called in exactly the same way in other processes but not in tclsh
.
Further, there is no workaround, not even using encoding system
to
configure Tcl as the driver is oblivious to Tcl, it is simply using the
GetACP
call which has been subverted by the presence of the manifest.
Only solution is to build a custom tclsh
.
This TIP addresses the first cause by removing the manifest while
preserving Tcl 9.0 compatibility by implementing equivalent functionality
within the Tcl core to have encoding system
return UTF-8. Extensions
loaded into Tcl will see the user's code page setting.
For the second root cause, which cannot be fixed without breaking 9.0.0/1
compatibility, the TIP proposes encoding user
and exec -encoding
as
workarounds.
It has been confirmed that a TIP 716 build fixes the HammerDB/TPC/DB2 failures.
Note there are other compatibility issues listed in the orginal core mailing thread as a result of forcing UTF-8 as the system encoding on Windows. However, those are not addressed by this TIP as there is no way to fix them without breaking with 9.0.0/1 compatibility.
Other languages
Only including this because this was one of the points raised in the mailing
list discussion. The only language I found that includes a manifest was R.
Python is transitioning to UTF-8 on Windows only October 2026 (3.15) and
not via the manifest in any case. Lua only deals with bytes. Ruby's
Encoding.default_external
setting follows the Windows code page. Raku (Perl 6)
uses UTF-8 across all platforms but does not use the manifest setting.
Java transitioned to UTF-8 in Java 18, but again does not use the manifest.
The R case is interesting. Their experience is blogged in a post. In summary, their motivation was a large, monolithic (with static linking including some 19000 extensions) code base that, unlike Tcl, primarily used the ANSI version of Windows API's. It was deemed to difficult to transition to the Unicode API's and they chose force the UTF-8 code page through the manifest instead. The effort took several years. Interested folks can read the section Active code page and consequences, in particular the one titled What nightmares are made of ;-)
Implementation
Implementation is in the tip-716 branch.
Copyright
This document has been placed in the public domain.