Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | edit |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
a93193cf6f39caa1f038f82fcc22eff0 |
User & Date: | dgp 2020-03-09 23:51:36.952 |
Context
2020-03-10
| ||
00:23 | demo check-in: 16a3b9a8df user: dgp tags: trunk | |
2020-03-09
| ||
23:51 | edit check-in: a93193cf6f user: dgp tags: trunk | |
23:37 | WIP on TIP 568 check-in: 6aac90669d user: dgp tags: trunk | |
Changes
Changes to tip/568.md.
︙ | ︙ | |||
83 84 85 86 87 88 89 | the existing specification of the routine **Tcl_GetByteArrayFromObj** fails that test. # History and Rationale Starting with release 8.1, Tcl string values were composed of characters from an international character set. Each string was conceived | | | | | | | | | | | > < | | | | | | | | | | > | > | < | 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | the existing specification of the routine **Tcl_GetByteArrayFromObj** fails that test. # History and Rationale Starting with release 8.1, Tcl string values were composed of characters from an international character set. Each string was conceived as a sequence of characters from UCS-2. Each routine that accepts a string as a *char* array, and the string representation of a *Tcl_Obj* both expect to store a UCS-2 sequence in a Modified UTF-8 encoding. (For release 8.7, We are in working on extending the Tcl character set from UCS-2 to all of Unicode, but that will not change the two facts that are important here: (1) For reliable binary transfer, we can no longer simply write arbitrary bytes to a string representation; (2) General Tcl strings contain characters outside the byte range.) The new rules for encoding string values created the need for a new mechanism to accept, transmit, store, and produce arbitrary binary values, preferably while minimizing the need to convert to other representations. The _bytearray_ **Tcl_ObjType** was created to address this need. The routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in a *Tcl_Obj*. The routine **Tcl_GetByteArrayFromObj** can then retrieve that same sequence. When the string representation of such a value is needed, each byte (with value from 0-255) in the sequence is treated as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that UCS-2 sequence is encoded in Modified UTF-8 in the usual way. This strategy permits all byte sequences to be encoded in a subset of Tcl string values. When **Tcl_GetByteArrayFromObj** is called on a value where no byte sequence has been stored, a byte sequence is generated from the string representation. When the string is in the subset of strings that can be produced by encoding byte sequences, the decoding is clear. For other string values, those that contain at least one codepoint greater than U+00FF, it was decided that any larger codepoint in the string value would have its high bits stripped away, and be decoded based on the contents of the low 8 bits it contained. Given this decision, all strings produce a byte sequence, and **Tcl_GetByteArrayFromObj** would always return a result. It did not need to provide for raising errors. This decision is the source of all the trouble. When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte sequence, it does not know whether this is a sequence originally stored, or one generated by transforming and possibly truncating characters from a general Tcl string value. This means the contents of the byte sequence do not reliably reveal all about the value. We could not say, for example, what the 3rd character in the value is. At best we could say what is left when all high bits are stripped off that character. It is not a common need to treat all string values according to equivalence classes set by examining only the low-bytes of every character. If we supplement the call to **Tcl_GetByteArrayFromObj** with a call to **Tcl_HasStringRep**, we might learn that the value does not have a string representation stored within it. In that case, we have what we have come to call a _pure_ bytearray value. We can then be sure the byte sequence is an original one. It did not come from a string since there is no string. We can then use the byte sequence as the definitive value. This helps, but it only goes so far. If anything causes the string representation to be generated, we lose this supplementary test, and we are back to being unable to use the byte sequence at all. Bug examples. # Specification # Compatibility |
︙ | ︙ |