Check-in [f86fa8720b]

Login
Bounty program for improvements to Tcl and certain Tcl packages.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:WIP
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: f86fa8720b760b6999fb786e8fa6aaa5658761bee3aaac127d41ed24d0e6d396
User & Date: dgp 2020-03-11 14:01:45
Context
2020-03-11
16:55
WIP check-in: d302346283 user: dgp tags: trunk
14:01
WIP check-in: f86fa8720b user: dgp tags: trunk
2020-03-10
00:26
format check-in: 10ca9bc3e9 user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to tip/568.md.

69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
..
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
...
114
115
116
117
118
119
120
121

122
123
124
125
126
127
128
129
130
131
132

133
134
135
136
137
138
139
140
141
142
143
144
145
146


147
148
149
150
151
152
153
154
155
156
157
158
159
160
161





















162







163
164
165
166
167
168
169
an array stored within Tcl's own structures.  It is implicit in the
structure of the interface that it is presumed this extraction cannot
fail. Thus there is no need to arrange for an interpreter to receive
error messages or codes. The caller is instructed that the return value
need not be checked for **NULL**. The final output argument is a pointer to
space where the length of the *foo* array returned may be written.
Documentation must be consulted to determine the conditions under which
the caller my read, or even write, to that array. This pattern is
followed in the cases where **Foo** is **String**, **Unicode**,
or **ByteArray**.

Again it must be understood that all of these routines have value only
so far as they produce a *foo* value that caller can use in the place
of operating directly on the *Tcl_Obj*.  Experience has proven that
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values were composed of
................................................................................
as a sequence of characters from UCS-2.  Each routine that accepts a string
as a *char* array, and the string representation of a *Tcl_Obj* both expect
to store a UCS-2 sequence in a Modified UTF-8 encoding.  (For release 8.7,
We are in working on extending the Tcl character set from UCS-2 to all of
Unicode, but that will not change the two facts that are important here:
(1) For reliable binary transfer, we can no longer simply write arbitrary bytes
to a string representation; (2) General Tcl strings contain characters
outside the byte range.) The new rules for encoding string values created
the need for a new mechanism to accept, transmit, store, and produce
arbitrary binary values, preferably while minimizing the need to convert to
other representations.

The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same sequence.  
When the string representation of such a value is
needed, each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

................................................................................
has been stored, a byte sequence is generated from the string representation.
When the string is in the subset of strings that can be produced by encoding
byte sequences, the decoding is clear. For other string values, those that
contain at least one codepoint greater than U+00FF, it was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the contents of the low 8 bits
it contained.  Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** would always return a result. It did not need

to provide for raising errors.  This decision is the source of all the trouble.

When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte
sequence, it does not know whether this is a sequence originally stored,
or one generated by transforming and possibly truncating characters from
a general Tcl string value.  This means the contents of the byte sequence
do not reliably reveal all about the value.  We could not say, for example,
what the 3rd character in the value is. At best we could say what is left
when all high bits are stripped off that character.  It is not a common
need to treat all string values according to equivalence classes set by
examining only the low-bytes of every character.


If we supplement the call to **Tcl_GetByteArrayFromObj** with a call
to **Tcl_HasStringRep**, we might learn that the value does not have
a string representation stored within it. In that case, we have what
we have come to call a _pure_ bytearray value. We can then be sure the byte
sequence is an original one. It did not come from a string since there is
no string.  We can then use the byte sequence as the definitive value.
This helps, but it only goes so far.
If anything causes the string representation to be generated, we lose this
supplementary test, and we are back to being unable to use the byte sequence
at all.

This is tricky enough, callers of **Tcl_GetByteArrayFromObj** get it wrong,
even the callers within Tcl itself.  See Tcl bugs [0e92c404f1] and


[2637173] and this interactive demonstration:

<pre>
```
>	% info patch
>	8.5.15
>	% set s \u0141
>	Ł
>	% binary scan $s c x
>	1
>	% string index $s 0
>	A
```
</pre>






















# Specification








# Compatibility

# Scope

This TIP proposes a change to **Tcl_GetByteArrayFromObj** that is an
incompatibity in its return value.  There is also interest (see [TIP 481])






|




|







 







|
|
|
|




|







 







|
>
|





|
|
|
|
|
>



|



|





|
>
>
|



|










>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

>
>
>
>
>
>
>







69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
..
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
...
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
an array stored within Tcl's own structures.  It is implicit in the
structure of the interface that it is presumed this extraction cannot
fail. Thus there is no need to arrange for an interpreter to receive
error messages or codes. The caller is instructed that the return value
need not be checked for **NULL**. The final output argument is a pointer to
space where the length of the *foo* array returned may be written.
Documentation must be consulted to determine the conditions under which
the caller may read, or even write to that array. This pattern is
followed in the cases where **Foo** is **String**, **Unicode**,
or **ByteArray**.

Again it must be understood that all of these routines have value only
so far as they produce a *foo* value that the caller can use in the place
of operating directly on the *Tcl_Obj*.  Experience has proven that
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values were composed of
................................................................................
as a sequence of characters from UCS-2.  Each routine that accepts a string
as a *char* array, and the string representation of a *Tcl_Obj* both expect
to store a UCS-2 sequence in a Modified UTF-8 encoding.  (For release 8.7,
We are in working on extending the Tcl character set from UCS-2 to all of
Unicode, but that will not change the two facts that are important here:
(1) For reliable binary transfer, we can no longer simply write arbitrary bytes
to a string representation; (2) General Tcl strings contain characters
with codepoints outside the byte range.) The new rules for encoding string
values created the need for a new mechanism to accept, transmit, store, and
produce arbitrary binary values, preferably while minimizing the need to
convert to other representations.

The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same byte sequence.  
When the string representation of such a value is
needed, each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

................................................................................
has been stored, a byte sequence is generated from the string representation.
When the string is in the subset of strings that can be produced by encoding
byte sequences, the decoding is clear. For other string values, those that
contain at least one codepoint greater than U+00FF, it was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the contents of the low 8 bits
it contained.  Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** would always return a non-NULL pointer to
a byte sequence. The routine did not need specify any mechanism
for raising errors.  This decision is the source of all the trouble.

When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte
sequence, it does not know whether this is a sequence originally stored,
or one generated by transforming and possibly truncating characters from
a general Tcl string value.  This means the contents of the byte sequence
do not reliably reveal information about the value.  We could not say,
for example, what the 3rd character in the value is. At best we could
say what is left when all high bits are stripped off that character.
It is not a common need to treat all string values according to
the equivalence classes established by examining only the low-bytes of
every character.

If we supplement the call to **Tcl_GetByteArrayFromObj** with a call
to **Tcl_HasStringRep**, we might learn that the value does not have
a string representation stored within it. In that case, the value is what
we have come to call a _pure_ bytearray value. We can then be sure the byte
sequence is an original one. It did not come from a string since there is
no string.  We can then use the byte sequence as the definitive value.
This helps, but only so much.
If anything causes the string representation to be generated, we lose this
supplementary test, and we are back to being unable to use the byte sequence
at all.

This is tricky enough, callers of **Tcl_GetByteArrayFromObj** get it wrong,
even the callers within Tcl itself.  See Tcl
bugs [0e92c404f1](https://core.tcl-lang.org/tcl/info/0e92c404f1)
and [2637173](https://core.tcl-lang.org/tcl/info/2637173) and this
interactive demonstration:

<pre>
```
>	% info patchlevel
>	8.5.15
>	% set s \u0141
>	Ł
>	% binary scan $s c x
>	1
>	% string index $s 0
>	A
```
</pre>

In Tcl 8.5.16+ these bugs are fixed. In Tcl 8.5 and 8.6, the testing for
pure bytearray values is performed, so the wrong results are no longer
produced. However, this approach still loses all utility of byte array
internal representations as soon as a string representation is generated.
In Tcl 8.7, the internals have been reworked so that two distinct
**Tcl_ObjType**s are in use. One of these implements a proper byte array
**Tcl_ObjType** that rejects string values that contain any non-byte 
character. The other tolerates such strings, so that the behavior of
**Tcl_GetByteArrayFromObj** can be continued without any incompatibility.

The next evolutionary step is to incompatibly revise the documented
interface for **Tcl_GetByteArrayFromObj** so that it can return **NULL**
and so that each caller is burdened with handling that possibility.
The routine **Tcl_SetByteArrayLength** has the same concerns and should
change in the same way.

With the plan to allow extraction of bytes from a value to fail, it
becomes clear in hindsight that a routine following the second pattern
would be more appropriate, so that Tcl can offer the service of generating
a standard error message and error code.

# Specification

In Tcl 8.7, create a new public routine,

> _unsigned char *_ **Tcl_GetBytesFromObj**(*Tcl_Interp* _*_, *Tcl_Obj* _*_,  _int *_);

which attempts to extract a byte sequence format of the value returns a pointer to a byte sequence


# Compatibility

# Scope

This TIP proposes a change to **Tcl_GetByteArrayFromObj** that is an
incompatibity in its return value.  There is also interest (see [TIP 481])