Author: Jan Nijtmans <[email protected]>
State: Final
Type: Project
Vote: Done
Vote-Summary: Accepted 9/0/0
Votes-For: BG, DGP, DKF, FV, JN, KBK, KW, MC, SL
Votes-Against: none
Votes-Present: none
Created: 20-Sept-2020
Post-History:
Keywords: Tcl source
Tcl-Version: 9.0
Tcl-Branch: tip-587
Abstract
This TIP proposes to make the default encoding for the "source" command utf-8.
Rationale
The "utf-8" encoding is the most universal encoding available, more and more systems use it as system encoding. Introducing this change means that Tcl-9 (or even 8.7) scripts can start using any Unicode character without needing to use the escaped \u???? or \U?????? forms.
Starting with TIP #389 (Tcl 8.7), invalid bytes 0x80 up to 0x9F are interpreted as the cp1252 characters € up to Ÿ. This means that any script which was written in the "cp1252" encoding (most common on Windows) will get the expected outcome using Tcl's "utf-8" decoder. Since "cp1252" is a superset of "iso8859-1" (most common on older UNIX'es), the same holds for "iso8859-1".
Tcl's current "utf-8" decoder strips the first BOM (Byte Order Mark) from the stream. This means that even scripts using the Windows BOM-prepended "utf-8" files (not recommended, but still can be produced by Notepad) will work fine.
If Tcl is not fully initialized (so just Tcl_CreateInterp()
, not Tcl_Main()
),
it doesn't get a chance to determine the system encoding from the environment.
On Windows, it falls back to "cp1252" in that case, on MacOS to "utf-8" and
on other systems to "iso8859-1". It's unlikely really being noticed, but this
TIP proposes to change that to "utf-8" in all cases. Since this is a
potential incompatibility for external applications embedding Tcl, this
change should be made for Tcl 9.0 only.
Specification
- The
source
command without-encoding
will assume "utf-8" as default in stead of the system encoding. - Scripts started from the "tclsh" commandline without
-encoding
will read the file using Tcl's "utf-8" decoder. - The fallback encoding for embedded build information (TIP #59) will change from "cp1252" (windows)/"iso-8859-1"(UNIX) to "utf-8".
- The
open
command and other channel functionalities are kept as they are now. - The fallback for the system encoding will change to "utf-8" on all platforms. Tcl 9.0 only.
This TIP is basically for 9.0. But all changes can be made on 8.7 with almost no risk, except for the last one. Therefore, there will be separate votes for bringing this to 9.0 and bringing it (except for the last point) to 8.7 too.
Compatibility
For scripts using ASCII characters only (as was the only portable way to write scripts in Tcl 8.x) this is fully upwards compatible. Scripts assuming a different system encoding than "ascii", "utf-8", "cp1252" or "iso-8859-1" will break. On MacOS and most modern UNIX-like systems (which already have "utf-8" as system encoding) it's 100% compatible.
Implementation
See the tip-587
branch.
Tcl 8.7 part: tip-587-for-8.7
branch.
A Tk modified demo showing how this eliminates the need to use Unicode
escape sequences is here: tip-587
branch.
(This demo works with Tk 8.6 too, even without this TIP, since the "widget"
demo is adapted to source all other demos using -encoding utf-8
explicitly)
Copyright
This document has been placed in the public domain.