Check-in [6aac90669d]

Login
Bounty program for improvements to Tcl and certain Tcl packages.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:WIP on TIP 568
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 6aac90669d074379adf6a69d216f5464bd9c0b7ba680cf2a38f48f2a5c532f4b
User & Date: dgp 2020-03-09 23:37:04
Context
2020-03-09
23:51
edit check-in: a93193cf6f user: dgp tags: trunk
23:37
WIP on TIP 568 check-in: 6aac90669d user: dgp tags: trunk
2020-03-06
16:25
CFV on TIP 569 check-in: 807759b05d user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to tip/568.md.

37
38
39
40
41
42
43
44


45
46
47
48
49
50
51
..
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
..
79
80
81
82
83
84
85
86
87
88
89
90




91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117









118









119
120
121

122
123
124
125
126
127
128
The _int_ return allows for returning **TCL_ERROR** when no valid *foo*
can be extracted. The *Tcl_Interp* is provided to receive standardized
error messages and codes on failure.  The final output argument is a pointer
to space where the extracted *foo* value may be written. The value written
is now the possession of the caller, generated by the routine, possible by
making a copy of something stored in the *Tcl_Obj*.  This pattern is followed
in the cases where **Foo** is **Boolean**, **Bignum**, **Double**,
**Encoding**, **Index**, **Int**, **Long**, or **WideInt**.



The second pattern,

> _foo_ **Tcl_GetFooFromObj**(*Tcl_Interp* _*_, *Tcl_Obj* _*_, ...);

is used when the *foo* representation is a token that can take on
the value **NULL**.  In this case, a return of **NULL** by the routine
................................................................................
the first pattern, a *Tcl_Interp* is provided to receive standardized
error messages and codes on failure. Documentation must be consulted
to determine any constraints on the use of the returned token value by
the caller.  It is likely to rely on information stored within the internal
structures of Tcl, which may need management with reference counting,
memory preservation, and/or maintaining a claim on the original *Tcl_Obj*.
This pattern is followed in the cases where **Foo** is **RegExp**
or **Command**.

The final pattern,

> _foo_ **Tcl_GetFooFromObj**(*Tcl_Obj* _*_, _int *_);

is used when the returned *foo* is a pointer value pointing into
an array stored within Tcl's own structures.  It is implicit in the
................................................................................
so far as they produce a *foo* value that caller can use in the place
of operating directly on the *Tcl_Obj*.  Experience has proven that
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values have been composed of
characters from an international character set. Each string is conceived
as a sequence of characters from UCS-2.  Each routine accepting a string
as a *char* array, and the string representation of each *Tcl_Obj* expect
to store a UCS-2 sequence in a Modified UTF-8 encoding.  This change




created the need for a new mechanism to accept, transmit, store, and
produce arbitrary binary values, preferably while minimizing the need to
convert to other representations.

The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same sequence.  

When the string representation of the value is
needed each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

When a byte sequence is not present, but one is
needed, it may be generated from the string representation. When
the string is one from the subset produced by encoding byte sequences,
the decoding is clear, but what about other string values, those that
contain at least one codepoint greater than U+00FF?  It was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the low 8 bits it contained.
Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** could be defined using the patterns that
never produces errors.  This decision is the source of all the trouble.


























# Specification

# Compatibility

# Scope






|
>
>







 







|







 







|
|

|
|
>
>
>
>
|
|
|













|
|






|
|

>
>
>
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>

<
<
>







37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
..
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
..
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143


144
145
146
147
148
149
150
151
The _int_ return allows for returning **TCL_ERROR** when no valid *foo*
can be extracted. The *Tcl_Interp* is provided to receive standardized
error messages and codes on failure.  The final output argument is a pointer
to space where the extracted *foo* value may be written. The value written
is now the possession of the caller, generated by the routine, possible by
making a copy of something stored in the *Tcl_Obj*.  This pattern is followed
in the cases where **Foo** is **Boolean**, **Bignum**, **Double**,
**Encoding**, **Index**, **Int**, **Long**, or **WideInt**. (See also
private routines where **Foo** is **Channel**, **Number**, **CompletionCode**,
**WideBits**, or **Namespace**.)

The second pattern,

> _foo_ **Tcl_GetFooFromObj**(*Tcl_Interp* _*_, *Tcl_Obj* _*_, ...);

is used when the *foo* representation is a token that can take on
the value **NULL**.  In this case, a return of **NULL** by the routine
................................................................................
the first pattern, a *Tcl_Interp* is provided to receive standardized
error messages and codes on failure. Documentation must be consulted
to determine any constraints on the use of the returned token value by
the caller.  It is likely to rely on information stored within the internal
structures of Tcl, which may need management with reference counting,
memory preservation, and/or maintaining a claim on the original *Tcl_Obj*.
This pattern is followed in the cases where **Foo** is **RegExp**
or **Command**. (See also private routine where **Foo** is **Lambda**.)

The final pattern,

> _foo_ **Tcl_GetFooFromObj**(*Tcl_Obj* _*_, _int *_);

is used when the returned *foo* is a pointer value pointing into
an array stored within Tcl's own structures.  It is implicit in the
................................................................................
so far as they produce a *foo* value that caller can use in the place
of operating directly on the *Tcl_Obj*.  Experience has proven that
the existing specification of the routine **Tcl_GetByteArrayFromObj**
fails that test.

# History and Rationale

Starting with release 8.1, Tcl string values were composed of
characters from an international character set. Each string was conceived
as a sequence of characters from UCS-2.  Each routine accepting a string
as a *char* array, and the string representation of each *Tcl_Obj* expected
to store a UCS-2 sequence in a Modified UTF-8 encoding.  (We are in
progress extending the Tcl character set from UCS-2 to all of Unicode,
but that will not change the important matters here. For reliable
binary transfer, we can no longer simply write arbitrary bytes to
a string representation, and Tcl strings in general contain characters
outisde the byte range.) This change created the need for a new mechanism
to accept, transmit, store, and produce arbitrary binary values, preferably
while minimizing the need to convert to other representations.

The _bytearray_ **Tcl_ObjType** was created to address this need. The
routine **Tcl_NewByteArrayObj** stores an arbitrary byte sequence in
a *Tcl_Obj*.  The routine **Tcl_GetByteArrayFromObj** can then retrieve
that same sequence.  

When the string representation of the value is
needed each byte (with value from 0-255) in the sequence is treated
as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that
UCS-2 sequence is encoded in Modified UTF-8 in the usual way.  This
strategy permits all byte sequences to be encoded in a subset of
Tcl string values. 

When **Tcl_GetByteArrayFromObj** is called on a value where no byte sequence
has been stored, one is generated from the string representation. When
the string is one from the subset produced by encoding byte sequences,
the decoding is clear, but what about other string values, those that
contain at least one codepoint greater than U+00FF?  It was decided
that any larger codepoint in the string value would have its high
bits stripped away, and be decoded based on the low 8 bits it contained.
Given this decision, all strings produce a byte sequence, and
**Tcl_GetByteArrayFromObj** would always return a result. It did not need
to provide for raising errors.  This decision is the source of all the trouble.

When a caller of **Tcl_GetByteArrayFromObj** receives access to a byte
sequence, it does not know whether this is a sequence originally stored,
or one generated by transforming and possibly truncating characters from
a general Tcl string value.  This means the contents of the byte sequence
do not reliably reveal much about the value.  We could not say, for example,
what the 3rd character in the value is. At best we could say what is left
when all high bits are stripped off that character.  It is not a common
need to treat all string values according to equivalence classes set by
examining only the low-bytes of every character.

If we supplement the call to **Tcl_GetByteArrayFromObj** with a call
to **Tcl_HasStringRep**, we might learn that the value does not have
a string representation stored within it. In that case, we have what
we have come to call a _pure_ bytearray value, and we could use the
byte sequence as the definitive value. Note however that this is fragile.
If anything causes the string representation to be generated, we lose this
supplementary test, and we are back to being unable to use the byte sequence
at all.  Testing for pure bytearrays can help, but it cannot solve the total
problem.



Bug examples.


# Specification

# Compatibility

# Scope