Check-in [a93193cf6f]

Login
Bounty program for improvements to Tcl and certain Tcl packages.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:edit
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: a93193cf6f39caa1f038f82fcc22eff0b0de5ad163f2417d466f3e7ce5758b85
User & Date: dgp 2020-03-09 23:51:36
Context
2020-03-10
00:23
demo check-in: 16a3b9a8df user: dgp tags: trunk
2020-03-09
23:51
edit check-in: a93193cf6f user: dgp tags: trunk
23:37
WIP on TIP 568 check-in: 6aac90669d user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to tip/568.md.

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

143
144
145
146
147
148
149
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values were composed of
characters from an international character set. Each string was conceived
as a sequence of characters from UCS-2.  Each routine accepting a string
as a *char* array, and the string representation of each *Tcl_Obj* expected
to store a UCS-2 sequence in a Modified UTF-8 encoding.  (We are in
progress extending the Tcl character set from UCS-2 to all of Unicode,
but that will not change the important matters here. For reliable
binary transfer, we can no longer simply write arbitrary bytes to
a string representation, and Tcl strings in general contain characters
outisde the byte range.) This change created the need for a new mechanism
to accept, transmit, store, and produce arbitrary binary values, preferably
while minimizing the need to convert to other representations.


The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same sequence.  

When the string representation of the value is
needed each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

When **Tcl_GetByteArrayFromObj** is called on a value where no byte sequence
has been stored, one is generated from the string representation. When
the string is one from the subset produced by encoding byte sequences,
the decoding is clear, but what about other string values, those that
contain at least one codepoint greater than U+00FF?  It was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the low 8 bits it contained.
Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** would always return a result. It did not need
to provide for raising errors.  This decision is the source of all the trouble.

When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte
sequence, it does not know whether this is a sequence originally stored,
or one generated by transforming and possibly truncating characters from
a general Tcl string value.  This means the contents of the byte sequence
do not reliably reveal much about the value.  We could not say, for example,
what the 3rd character in the value is. At best we could say what is left
when all high bits are stripped off that character.  It is not a common
need to treat all string values according to equivalence classes set by
examining only the low-bytes of every character.

If we supplement the call to **Tcl_GetByteArrayFromObj** with a call
to **Tcl_HasStringRep**, we might learn that the value does not have
a string representation stored within it. In that case, we have what
we have come to call a _pure_ bytearray value, and we could use the
byte sequence as the definitive value. Note however that this is fragile.
If anything causes the string representation to be generated, we lose this
supplementary test, and we are back to being unable to use the byte sequence
at all.  Testing for pure bytearrays can help, but it cannot solve the total
problem.


Bug examples.


# Specification

# Compatibility






|
|
|
|
|
|
|
|
|
|
>





<
|
|






|
|
|
|

|
|







|








|
|
|
|
|
|
>







83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values were composed of
characters from an international character set. Each string was conceived
as a sequence of characters from UCS-2.  Each routine that accepts a string
as a *char* array, and the string representation of a *Tcl_Obj* both expect
to store a UCS-2 sequence in a Modified UTF-8 encoding.  (For release 8.7,
We are in working on extending the Tcl character set from UCS-2 to all of
Unicode, but that will not change the two facts that are important here:
(1) For reliable binary transfer, we can no longer simply write arbitrary bytes
to a string representation; (2) General Tcl strings contain characters
outside the byte range.) The new rules for encoding string values created
the need for a new mechanism to accept, transmit, store, and produce
arbitrary binary values, preferably while minimizing the need to convert to
other representations.

The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same sequence.  

When the string representation of such a value is
needed, each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

When **Tcl_GetByteArrayFromObj** is called on a value where no byte sequence
has been stored, a byte sequence is generated from the string representation.
When the string is in the subset of strings that can be produced by encoding
byte sequences, the decoding is clear. For other string values, those that
contain at least one codepoint greater than U+00FF, it was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the contents of the low 8 bits
it contained.  Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** would always return a result. It did not need
to provide for raising errors.  This decision is the source of all the trouble.

When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte
sequence, it does not know whether this is a sequence originally stored,
or one generated by transforming and possibly truncating characters from
a general Tcl string value.  This means the contents of the byte sequence
do not reliably reveal all about the value.  We could not say, for example,
what the 3rd character in the value is. At best we could say what is left
when all high bits are stripped off that character.  It is not a common
need to treat all string values according to equivalence classes set by
examining only the low-bytes of every character.

If we supplement the call to **Tcl_GetByteArrayFromObj** with a call
to **Tcl_HasStringRep**, we might learn that the value does not have
a string representation stored within it. In that case, we have what
we have come to call a _pure_ bytearray value. We can then be sure the byte
sequence is an original one. It did not come from a string since there is
no string.  We can then use the byte sequence as the definitive value.
This helps, but it only goes so far.
If anything causes the string representation to be generated, we lose this
supplementary test, and we are back to being unable to use the byte sequence
at all.

Bug examples.


# Specification

# Compatibility