TIP 587: Default utf-8 for source command

Login
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Vote-Summary:   Accepted 9/0/0
Votes-For:      BG, DGP, DKF, FV, JN, KBK, KW, MC, SL
Votes-Against:  none
Votes-Present:  none
Created:        20-Sept-2020
Post-History:
Keywords:       Tcl source
Tcl-Version:    9.0
Tcl-Branch:     tip-587

Abstract

This TIP proposes to make the default encoding for the "source" command utf-8.

Rationale

The "utf-8" encoding is the most universal encoding available, more and more systems use it as system encoding. Introducing this change means that Tcl-9 (or even 8.7) scripts can start using any Unicode character without needing to use the escaped \u???? or \U?????? forms.

Starting with TIP #389 (Tcl 8.7), invalid bytes 0x80 up to 0x9F are interpreted as the cp1252 characters € up to Ÿ. This means that any script which was written in the "cp1252" encoding (most common on Windows) will get the expected outcome using Tcl's "utf-8" decoder. Since "cp1252" is a superset of "iso8859-1" (most common on older UNIX'es), the same holds for "iso8859-1".

Tcl's current "utf-8" decoder strips the first BOM (Byte Order Mark) from the stream. This means that even scripts using the Windows BOM-prepended "utf-8" files (not recommended, but still can be produced by Notepad) will work fine.

If Tcl is not fully initialized (so just Tcl_CreateInterp(), not Tcl_Main()), it doesn't get a chance to determine the system encoding from the environment. On Windows, it falls back to "cp1252" in that case, on MacOS to "utf-8" and on other systems to "iso8859-1". It's unlikely really being noticed, but this TIP proposes to change that to "utf-8" in all cases. Since this is a potential incompatibility for external applications embedding Tcl, this change should be made for Tcl 9.0 only.

Specification

This TIP is basically for 9.0. But all changes can be made on 8.7 with almost no risk, except for the last one. Therefore, there will be separate votes for bringing this to 9.0 and bringing it (except for the last point) to 8.7 too.

Compatibility

For scripts using ASCII characters only (as was the only portable way to write scripts in Tcl 8.x) this is fully upwards compatible. Scripts assuming a different system encoding than "ascii", "utf-8", "cp1252" or "iso-8859-1" will break. On MacOS and most modern UNIX-like systems (which already have "utf-8" as system encoding) it's 100% compatible.

Implementation

See the tip-587 branch. Tcl 8.7 part: tip-587-for-8.7 branch.

A Tk modified demo showing how this eliminates the need to use Unicode escape sequences is here: tip-587 branch. (This demo works with Tk 8.6 too, even without this TIP, since the "widget" demo is adapted to source all other demos using -encoding utf-8 explicitly)

Copyright

This document has been placed in the public domain.