Tcl Source Code

Check-in [28ec05e046]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:WIP
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | dgp-review
Files: files | file ages | folders
SHA3-256: 28ec05e046d806a2bf61aaf534f0fe037e427245f4edd447b33414af7d9ee0cd
User & Date: dgp 2020-02-12 17:54:41.138
Context
2020-02-12
18:16
WIP check-in: 7baffdc778 user: dgp tags: dgp-review
17:54
WIP check-in: 28ec05e046 user: dgp tags: dgp-review
17:13
WIP check-in: 3a7d78ac06 user: dgp tags: dgp-review
Changes
Unified Diff Ignore Whitespace Patch
Changes to doc/dev/value-history.md.
45
46
47
48
49
50
51
52

53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
works about Tcl from that time probably refer to a Tcl string as
a "sequence of characters".  In later developments the terms "byte"
and "character" have diverged in meaning. There is no defined limit on the
length of this string representation; the only limit is available memory.

There is a one-to-one connection between stored memory patterns and the
abstract notion of valid Tcl strings.  The Tcl string "cat" is always
represented by a 4-byte chunk of memory storing (0x63, 0x61, 0x74, 0x00).

Any different bytes in memory represent different strings. This means
byte comparisons are string comparisons and byte array size is string length.
The byte values themselves are what Tcl commands operate on. From early on,
the documentation has claimed complete freedom for each command to  interpret
the bytes that are passed to it as arguments (or "words"):

>	The first word is used to locate a command procedure to carry out
>	the command, then all of the words of the command are passed to
>	the command procedure.  The command procedure  is  free to
>	interpret each of its words in any way it likes, such as an integer,
>	variable name, list, or Tcl script.  Different commands interpret
>	their words differently.

In practice though, it has been expected that where interpretation of a
string element value as a member of a charset matters, the ASCII encoding
is presumed for the byte values 1..127. This is in agreement with the
representation of C string literals in all C compilers, and it anchors the
character definitions that are important to the syntax of Tcl itself. For
instance, the newline character that terminates a command in a script is
the byte value 0x0A . No command purporting to accept and evaluate
an argument as a Tcl script would be free to choose something else.  The
handling of byte values 128..255 showed more variation among commands that
took any particular note of them.  Tcl provided built-in commands
__format__ and __scan__, as well as the backslash encoding forms of
__\\xHH__ and __\\ooo__ to manage the transformation between string element
values and the corresponding numeric values.








|
>



















|







45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
works about Tcl from that time probably refer to a Tcl string as
a "sequence of characters".  In later developments the terms "byte"
and "character" have diverged in meaning. There is no defined limit on the
length of this string representation; the only limit is available memory.

There is a one-to-one connection between stored memory patterns and the
abstract notion of valid Tcl strings.  The Tcl string "cat" is always
represented by a 4-byte chunk of memory storing
(**0x63**, **0x61**, **0x74**, **0x00**).
Any different bytes in memory represent different strings. This means
byte comparisons are string comparisons and byte array size is string length.
The byte values themselves are what Tcl commands operate on. From early on,
the documentation has claimed complete freedom for each command to  interpret
the bytes that are passed to it as arguments (or "words"):

>	The first word is used to locate a command procedure to carry out
>	the command, then all of the words of the command are passed to
>	the command procedure.  The command procedure  is  free to
>	interpret each of its words in any way it likes, such as an integer,
>	variable name, list, or Tcl script.  Different commands interpret
>	their words differently.

In practice though, it has been expected that where interpretation of a
string element value as a member of a charset matters, the ASCII encoding
is presumed for the byte values 1..127. This is in agreement with the
representation of C string literals in all C compilers, and it anchors the
character definitions that are important to the syntax of Tcl itself. For
instance, the newline character that terminates a command in a script is
the byte value **0x0A** . No command purporting to accept and evaluate
an argument as a Tcl script would be free to choose something else.  The
handling of byte values 128..255 showed more variation among commands that
took any particular note of them.  Tcl provided built-in commands
__format__ and __scan__, as well as the backslash encoding forms of
__\\xHH__ and __\\ooo__ to manage the transformation between string element
values and the corresponding numeric values.

659
660
661
662
663
664
665



666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682













683
684
685
686
687
688
689
690
691
692
693
694
support for standards always lags their publication, software written in
conformance to one version of Unicode is likely to encounter data produced
in conformance to a later version.  In light of this, the best practice is
to accommodate and preserve unassigned codepoints as much as possible.
Software written to support Unicode 1.1 can then accept Unicode 2 data streams,pass them through and output them again undamaged. In this way a middleware
written to an obsolete Unicode standard can still support providers and
clients that seek to use characters assigned only in a later standard.



Unicode 1.1 left open the possibility that any codepoint in UCS-2 might
one day be assigned. Tcl 8.1 imposes no conditions on the encoding of any
**Tcl_UniChar** value at all.

The standards specifying text encodings publish in the mid-1990s were quite
clear and explicit about the right way to do things. They were often less
demanding and specific about how to respond in the presence of errors. The
spirit of Postel's Robustness Principle

>	*Be liberal in what you accept, and conservative in what you send.*

held considerable influence at the time. Many implementations chose to
accommodate input errors, especially when that was the natural result
of laziness. For example, the specification of FSS-UTF and all specifications
for UTF-8 are completely clear and explicit that the proper encoding
of the codepoint **U+0000** is the byte value **0x00**, (our old friend
the **NUL** byte!).

















decoding and strictness

UTF-16 and surrogate pairs

Compat with 8.0










>
>
>




|


|








|
>
>
>
>
>
>
>
>
>
>
>
>
>




<







660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703

704
705
706
707
708
709
710
support for standards always lags their publication, software written in
conformance to one version of Unicode is likely to encounter data produced
in conformance to a later version.  In light of this, the best practice is
to accommodate and preserve unassigned codepoints as much as possible.
Software written to support Unicode 1.1 can then accept Unicode 2 data streams,pass them through and output them again undamaged. In this way a middleware
written to an obsolete Unicode standard can still support providers and
clients that seek to use characters assigned only in a later standard.
The middleware would fail to properly answer any questions about the
later-assigned characters, such as a **string is alpha** test, but could
still not be immediately useless.
Unicode 1.1 left open the possibility that any codepoint in UCS-2 might
one day be assigned. Tcl 8.1 imposes no conditions on the encoding of any
**Tcl_UniChar** value at all.

The standards specifying text encodings published in the mid-1990s were quite
clear and explicit about the right way to do things. They were often less
demanding and specific about how to respond in the presence of errors. The
spirit of the Postel Robustness Principle

>	*Be liberal in what you accept, and conservative in what you send.*

held considerable influence at the time. Many implementations chose to
accommodate input errors, especially when that was the natural result
of laziness. For example, the specification of FSS-UTF and all specifications
for UTF-8 are completely clear and explicit that the proper encoding
of the codepoint **U+0000** is the byte value **0x00**, (our old friend
the **NUL** byte!). However, inspection of the FSS-UTF encoding rules
quickly reveals that the multi-byte encoding strategies have room to
encode all the codepoints that already have a proper encoding as a single
byte.  Unless we take active steps to prevent it, a decoding procedure
will happily decode the byte sequences (**0xC0**, **0x80**) or
(**0xE0**, **0x80**, **0x80**) as the codepoint **U+0000**.  Both
Unicode 1.1 and Unicode 2.0 include sample decoder routines that do
exactly that. The text of Appendix A.2 of The Unicode Standard 2.0
(which contains one definition of UTF-8 from 1996) explicitly states,

>	When converting from UTF-8 to Unicode values, however,
>	implementations do not need to check that the shortest encoding
>	is being used, which simplifies the conversion algorithm.







UTF-16 and surrogate pairs

Compat with 8.0