TIP 547: New encodings: UTF-16, UCS-2

Login
Author:         Jan Nijtmans <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        31-May-2019
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     tip-547

Abstract

This TIP proposes to add more encodings for handling utf-16 and ucs-2.

Rationale

Currently, Tcl only has one multi-byte Utf encoding named "unicode". Depending on how Tcl is compiled, this could be 16-bit or 32-bit. If 16-bit, then it's currently not clear whether surrogates are handled or not. Also, those encodings always use the platform-endian mode. There is no way to force little- or big-endianess.

Therefore this TIP proposes to clear up the ambiguity: Make clear that those encodings are always 16-bit, and provide different encodings for little- and big-endian. The "utf-16" variant handles surrogates while the "ucs-2" variant does not.

Specification

This document proposes:

Use case

Tk defines it's own "ucs-2be" encoding when compiled on little-endian machines. So, this TIP means that Tk no longer needs to provide this encoding any more.

Compatibility

This is fully upwards compatible, except when Tcl is compiled with -DTCL_UTF_MAX=6 (which is - actually - not supported).

Reference Implementation

A reference implementation is available in the tip-547 branch. https://core.tcl-lang.org/tcl/timeline?r=tip-547

Copyright

This document has been placed in the public domain.