TIP 716: New command ”encoding user”, remove UTF-8 manifest setting on Windows

Login
Author:         Ashok P. Nadkarni <[email protected]>
Tcl-Version:    9.0.2
State:          Draft
Type:           Project
Created:        2025-04-11

Abstract

This TIP proposes to remove the activeCodePage=UTF-8 entry from tclsh and wish Windows manifests and implement the equivalent behavior internal to Tcl. A new command and option is additionally proposed but existing Tcl 9 behavior is not changed.

Specification

The ActiveCodePage setting will be removed from tclsh.exe.manifest.in and wish.exe.manifest.in.

The Tcl_GetEncodingNameFromEnvironment will be modified to return encoding as per the GetACP call for Windows builds prior to 18362 and utf-8 for later builds. This preserves compatibility with Tcl 9.0/9.0.1 despite the changes to the manifests. Non-Windows platforms are unaffected.

A new API Tcl_GetEncodingNameForUser will be added. On Windows platforms, it will always return the encoding as per the GetACP call irrespective of the Windows build number. On non-Windows platforms it maps to Tcl_GetEncodingNameFromEnvironment. Correspondingly, a new command encoding user taking no arguments will be added on all platforms and will return the result of Tcl_GetEncodingNameForUser.

const char *Tcl_GetEncodingNameForUser(Tcl_DString *bufPtr);
encoding user

For compatibility reasons, the Tcl_GetEncodingNameForUser function will not be public via stubs in 9.0.2 but will be public in 9.1.

The exec command will get a new option -encoding that allows the caller to specify the encoding to be used the result of the command. If the option is not specified, the value specified in the TCL_EXEC_ENCODING environment variable is used. The value of this variable should be either the name of a Tcl encoding or user. In the latter case, the encoding used is that returne by Tcl_GetEncodingNameForUser. If TCL_EXEC_ENCODING is not defined in the environment, the option defaults to that returned by encoding system. Defaulting to encoding user actually works much better in practice with Windows programs as they generally use the user code page and not UTF-8. However that would break compatibility with 9.0.

Note: The -encoding option only applies to the result of the exec command and not any redirections, in particular the << redirection.

Question: is the TCL_EXEC_ENCODING environment variable desirable? It allows the user to not have to constantly specify -encoding at the command line. On the other hand, it is one more configuration knob.

Question: if the answer to the above question is Yes, should it also apply to the command pipes created with open? My inclination is open should be left out of this TIP as one can already configure encoding for opened pipes with fconfigure.

In order to avoid issues with ANSI encodings, Tcl and extensions should continue to use the Unicode Windows API's.

Rationale

Why remove the activeCodePage manifest setting

The presence of this setting breaks

There is no workaround for some of these issues without recompiling the Tcl shells.

See the Background section for the details.

Moreover, the motivation for adding this setting is not clear as it adds no benefit at all to either tclsh or wish. That setting is intended for applications that use the narrow character (ANSI) Windows API's. Since Tcl/Tk uses the wide character Unicode API's they provide no benefit to Tcl itself while breaking shared libraries that use ANSI API's but do not support UTF-8.

It is also to be noted that, so far as I can see, the manifest changes the behavior of a public API and command to ignore the user's system configuration of encodings without any TIP. (Apologies if it was in fact TIP'ed).

Removal of the manifest setting will fix the shared libraries using ANSI API's.

Why change Tcl_GetEncodingNameFromEnvironment

Removing the manifest setting reverts Tcl behaviour (on the affected Windows builds) to be consistent with prior Windows builds, other platforms and Tcl 8 as encoding system will reflect the user's registry setting for encodings.

However, since 9.0.0,1 have already shipped, reverting back to user settings in 9.0.2 would mean, non-ASCII files written in 9.0.0,1 would not be readable in 9.0.2.

One possibility is to ignore this issue and document the incompatibility. This does not seem very friendly. Furthermore, at some point in the future, when Tcl makes UTF-8 the default encoding on all platforms (see Discussion), compatibility will be broken again which is really unpalatable.

The TIP therefore proposes that Tcl_GetEncodingNameFromEnvironment always return utf-8 on the affected Windows builds.

The key difference with respect to the current implementation issue that this does not impact extensions that call GetACP solving the first issue listed above, or using MingW msvcrt builds.

However, the issue with exec and application data sharing remain, which leads to ...

Why add -encoding option to exec

Hard coding Tcl_GetEncodingNameFromEnvironment to utf-8 does not resolve the problem with exec of programs that use the user's code page settings. Further, asking users to switch from the convenience of exec to open so the encodings can be fconfigure-ed is not very friendly.

Adding an -encoding option to exec that allows the user to specify the encoding used in pipes makes this a little easier.

Why add Tcl_GetEncodingNameForUser and encoding user

Adding the -encoding option to exec entails the user / application knowing the user's settings. Expecting the user to look up the Windows registry and then map the numeric value to the appropriate Tcl encoding name (a mapping which is undocumented) is unreasonable. The encoding user command would encapsulate this so user could just exec -encoding [encoding user] instead.

Discussion

Background

Agreement on the encoding used is necessary any time that text data is shared irrespective of whether the sharing parties are different applications, the application and the system, or even different components of a single application. The manner of sharing may be through file content, network, function call arguments, clipboard, COM etc.

In some cases, for example (relatively) modern protocols like HTTP, the encoding in use is either explicitly passed as part of the protocol, or is defined in the protocol specification. There is no ambiguity in such cases.

In many cases however, there is no such explicit specification and encoding of shared text data depends on platform convention.

On Unix platforms, applications assume locale information from environment variable LC_ALL, LANG and friends. When storing a UTF-8 encoded file name sent over HTTP for example, the name is encoded as the byte sequence using the encoding specified by this locale. On most modern Linux systems, the locale defaults to UTF-8.

On Windows systems, the situation is a little more complicated. The locale preferences are stored in the Windows registry and include both system-wide and user settings. The encoding code page can be retrieved with the GetACP Windows API call. A further complication is that the Windows API comes in two forms: narrow (ANSI) and wide (Unicode). The latter expects data, such as file names, passed to it to be encoded in UTF-16. There is no ambiguity as such. The ANSI API expects any data passed to it to be in the encoding specified by user code page. In the HTTP example, the UTF-8 encoded file name must be encoded into UTF-16 if passed to the Unicode CreateFileW API or to the encoding returned by GetACP if passed to the ANSI CreateFileA API.

Note the code page setting has no impact on applications that use the Unicode Windows API.

For data to be sensibly shared when there is no explicit mechanism to negotiate or otherwise determine the encoding, this platform convention needs to be adhered to.

The activeCodePage manifest entry

Microsoft introduced the activeCodePage setting in Windows executable manifests as an aid to applications that use ANSI API's. The presence of this setting results in the GetACP call always returning UTF-8 irrespective of the user's actual code page setting. The intent was to make it easier for applications that use ANSI API's to support the full Unicode range. It is not useful for applications, like tclsh and wish, which use the Unicode API's. Further, Microsoft warns that not all Windows API's support this UTF-8 code page.

The purpose of adding this manifest entry for tclsh and wish, given that they use Unicode API's, was not TIP'ed and is unclear. However, it causes breakage as detailed next.

As a point of interest, on my two Windows 10 and 11 systems with most major applications installed, strings shows exactly two programs with this setting - tclsh and wish. The reader may interpret this data point as they wish. (The latest version of R, which I do not have, apparently does use this setting. More on that later.)

So what is the problem ...

The manifest entries in tclsh and wish cause the following failures and issues stemming from two root causes:

The TPC benchmarking failure reported in the core mailing list after updating Tcl 8.6 to 9.0 stemmed from mismatched code pages. This latter (DB2-like) failure is particularly treacherous as cause of failure is not apparent as the same DLL works fine called in exactly the same way in other processes but not in tclsh. Further, there is no workaround, not even using encoding system to configure Tcl as the driver is oblivious to Tcl, it is simply using the GetACP call which has been subverted by the presence of the manifest. Only solution is to build a custom tclsh.

This TIP addresses the first cause by removing the manifest while preserving Tcl 9.0 compatibility by implementing equivalent functionality within the Tcl core to have encoding system return UTF-8. Extensions loaded into Tcl will see the user's code page setting.

For the second root cause, which cannot be fixed without breaking 9.0.0/1 compatibility, the TIP proposes encoding user and exec -encoding as workarounds.

It has been confirmed that a TIP 716 build fixes the HammerDB/TPC/DB2 failures.

Note there are other compatibility issues listed in the orginal core mailing thread as a result of forcing UTF-8 as the system encoding on Windows. However, those are not addressed by this TIP as there is no way to fix them without breaking with 9.0.0/1 compatibility.

Other languages

Only including this because this was one of the points raised in the mailing list discussion. The only language I found that includes a manifest was R. Python is transitioning to UTF-8 on Windows only October 2026 (3.15) and not via the manifest in any case. Lua only deals with bytes. Ruby's Encoding.default_external setting follows the Windows code page. Raku (Perl 6) uses UTF-8 across all platforms but does not use the manifest setting. Java transitioned to UTF-8 in Java 18, but again does not use the manifest.

The R case is interesting. Their experience is blogged in a post. In summary, their motivation was a large, monolithic (with static linking including some 19000 extensions) code base that, unlike Tcl, primarily used the ANSI version of Windows API's. It was deemed to difficult to transition to the Unicode API's and they chose force the UTF-8 code page through the manifest instead. The effort took several years. Interested folks can read the section Active code page and consequences, in particular the one titled What nightmares are made of ;-)

Implementation

Implementation is in the tip-716 branch.

Copyright

This document has been placed in the public domain.