Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | WIP |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | dgp-review |
Files: | files | file ages | folders |
SHA3-256: |
28ec05e046d806a2bf61aaf534f0fe03 |
User & Date: | dgp 2020-02-12 17:54:41.138 |
Context
2020-02-12
| ||
18:16 | WIP check-in: 7baffdc778 user: dgp tags: dgp-review | |
17:54 | WIP check-in: 28ec05e046 user: dgp tags: dgp-review | |
17:13 | WIP check-in: 3a7d78ac06 user: dgp tags: dgp-review | |
Changes
Changes to doc/dev/value-history.md.
︙ | ︙ | |||
45 46 47 48 49 50 51 | works about Tcl from that time probably refer to a Tcl string as a "sequence of characters". In later developments the terms "byte" and "character" have diverged in meaning. There is no defined limit on the length of this string representation; the only limit is available memory. There is a one-to-one connection between stored memory patterns and the abstract notion of valid Tcl strings. The Tcl string "cat" is always | | > | | 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | works about Tcl from that time probably refer to a Tcl string as a "sequence of characters". In later developments the terms "byte" and "character" have diverged in meaning. There is no defined limit on the length of this string representation; the only limit is available memory. There is a one-to-one connection between stored memory patterns and the abstract notion of valid Tcl strings. The Tcl string "cat" is always represented by a 4-byte chunk of memory storing (**0x63**, **0x61**, **0x74**, **0x00**). Any different bytes in memory represent different strings. This means byte comparisons are string comparisons and byte array size is string length. The byte values themselves are what Tcl commands operate on. From early on, the documentation has claimed complete freedom for each command to interpret the bytes that are passed to it as arguments (or "words"): > The first word is used to locate a command procedure to carry out > the command, then all of the words of the command are passed to > the command procedure. The command procedure is free to > interpret each of its words in any way it likes, such as an integer, > variable name, list, or Tcl script. Different commands interpret > their words differently. In practice though, it has been expected that where interpretation of a string element value as a member of a charset matters, the ASCII encoding is presumed for the byte values 1..127. This is in agreement with the representation of C string literals in all C compilers, and it anchors the character definitions that are important to the syntax of Tcl itself. For instance, the newline character that terminates a command in a script is the byte value **0x0A** . No command purporting to accept and evaluate an argument as a Tcl script would be free to choose something else. The handling of byte values 128..255 showed more variation among commands that took any particular note of them. Tcl provided built-in commands __format__ and __scan__, as well as the backslash encoding forms of __\\xHH__ and __\\ooo__ to manage the transformation between string element values and the corresponding numeric values. |
︙ | ︙ | |||
659 660 661 662 663 664 665 666 667 668 669 | support for standards always lags their publication, software written in conformance to one version of Unicode is likely to encounter data produced in conformance to a later version. In light of this, the best practice is to accommodate and preserve unassigned codepoints as much as possible. Software written to support Unicode 1.1 can then accept Unicode 2 data streams,pass them through and output them again undamaged. In this way a middleware written to an obsolete Unicode standard can still support providers and clients that seek to use characters assigned only in a later standard. Unicode 1.1 left open the possibility that any codepoint in UCS-2 might one day be assigned. Tcl 8.1 imposes no conditions on the encoding of any **Tcl_UniChar** value at all. | > > > | | | > > > > > > > > > > > > > < | 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 | support for standards always lags their publication, software written in conformance to one version of Unicode is likely to encounter data produced in conformance to a later version. In light of this, the best practice is to accommodate and preserve unassigned codepoints as much as possible. Software written to support Unicode 1.1 can then accept Unicode 2 data streams,pass them through and output them again undamaged. In this way a middleware written to an obsolete Unicode standard can still support providers and clients that seek to use characters assigned only in a later standard. The middleware would fail to properly answer any questions about the later-assigned characters, such as a **string is alpha** test, but could still not be immediately useless. Unicode 1.1 left open the possibility that any codepoint in UCS-2 might one day be assigned. Tcl 8.1 imposes no conditions on the encoding of any **Tcl_UniChar** value at all. The standards specifying text encodings published in the mid-1990s were quite clear and explicit about the right way to do things. They were often less demanding and specific about how to respond in the presence of errors. The spirit of the Postel Robustness Principle > *Be liberal in what you accept, and conservative in what you send.* held considerable influence at the time. Many implementations chose to accommodate input errors, especially when that was the natural result of laziness. For example, the specification of FSS-UTF and all specifications for UTF-8 are completely clear and explicit that the proper encoding of the codepoint **U+0000** is the byte value **0x00**, (our old friend the **NUL** byte!). However, inspection of the FSS-UTF encoding rules quickly reveals that the multi-byte encoding strategies have room to encode all the codepoints that already have a proper encoding as a single byte. Unless we take active steps to prevent it, a decoding procedure will happily decode the byte sequences (**0xC0**, **0x80**) or (**0xE0**, **0x80**, **0x80**) as the codepoint **U+0000**. Both Unicode 1.1 and Unicode 2.0 include sample decoder routines that do exactly that. The text of Appendix A.2 of The Unicode Standard 2.0 (which contains one definition of UTF-8 from 1996) explicitly states, > When converting from UTF-8 to Unicode values, however, > implementations do not need to check that the shortest encoding > is being used, which simplifies the conversion algorithm. UTF-16 and surrogate pairs Compat with 8.0 |
︙ | ︙ |