TIP 568: Revise ByteArray Routines To Support Proper Value Extraction

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Don Porter <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Created:        4-Mar-2020
Post-History:
Keywords:       bytearray
Tcl-Version:    9.0
Tcl-Branch:	dgp-properbytearray

Abstract

PLEASE HOLD YOUR COMMENTS WHILE UNDER CONSTRUCTION.

This TIP revises the public routines Tcl_GetByteArrayObj and Tcl_SetByteArrayLength to signal an error when they are passed an objPtr that does not contain a valid byte sequence.

Background

It is a convention of the Tcl C interface to have routines with names like Tcl_GetFooFromObj. Their purpose is to extract a value in the foo representation from a (Tcl_Obj *) argument. The caller should then be able to process the extracted foo value confident that it represents the original Tcl_Obj value. Each one of these routines is associated with one or more Tcl_ObjTypes such that multiple calls to Tcl_GetFooFromObj can benefit from internal representation caching.

These routines follow one of three patterns. The first,

int Tcl_GetFooFromObj(Tcl_Interp *, Tcl_Obj *, ..., foo *);

is used when the foo representation is of fixed size that is not too large. The int return allows for returning TCL_ERROR when no valid foo can be extracted. The Tcl_Interp is provided to receive standardized error messages and codes on failure. The final output argument is a pointer to space where the extracted foo value may be written. The value written is now the possession of the caller, generated by the routine, possibly by making a copy of something stored in the Tcl_Obj. This pattern is followed in the cases where Foo is Boolean, Bignum, Double, Encoding, Index, Int, Long, or WideInt. (See also private routines where Foo is Channel, Number, CompletionCode, WideBits, or Namespace.)

The second pattern,

foo Tcl_GetFooFromObj(Tcl_Interp *, Tcl_Obj *, ...);

is used when the foo representation is a token that can take on the value NULL. In this case, a return of NULL by the routine signals the circumstance that no valid foo can be extracted. As in the first pattern, a Tcl_Interp is provided to receive standardized error messages and codes on failure. Documentation must be consulted to determine any constraints on the use of the returned token value by the caller. It is likely to rely on information stored within the internal structures of Tcl, which may need management with reference counting, memory preservation, and/or maintaining a claim on the original Tcl_Obj. This pattern is followed in the cases where Foo is RegExp or Command. (See also private routine where Foo is Lambda.)

The final pattern,

foo Tcl_GetFooFromObj(Tcl_Obj *, int *);

is used when the returned foo is a pointer value pointing into an array stored within Tcl's own structures. It is implicit in the structure of the interface that it is presumed this extraction cannot fail. Thus there is no need to arrange for an interpreter to receive error messages or codes. The caller is instructed that the return value need not be checked for NULL. The final output argument is a pointer to space where the length of the foo array returned may be written. Documentation must be consulted to determine the conditions under which the caller may read, or even write to that array. This pattern is followed in the cases where Foo is String, Unicode, or ByteArray.

Again it must be understood that all of these routines have value only so far as they produce a foo value that the caller can use in the place of operating directly on the Tcl_Obj. Experience has proven that the existing specification of the routine Tcl_GetByteArrayFromObj fails that test.

History and Rationale

Starting with release 8.1, Tcl string values were composed of characters from an international character set. Each string was conceived as a sequence of characters from UCS-2. Each routine that accepts a string as a char array, and the string representation of a Tcl_Obj both expect to store a UCS-2 sequence in a Modified UTF-8 encoding. (For release 8.7, We are in working on extending the Tcl character set from UCS-2 to all of Unicode, but that will not change the two facts that are important here: (1) For reliable binary transfer, we can no longer simply write arbitrary bytes to a string representation; (2) General Tcl strings contain characters with codepoints outside the byte range.)

The rules for encoding string values in Tcl 8.1+ created the need for a new mechanism to accept, transmit, store, and produce arbitrary binary values, preferably while minimizing the need to convert to other representations. The bytearray Tcl_ObjType was created to address this need. The routine Tcl_NewByteArrayObj stores an arbitrary byte sequence in a Tcl_Obj. The routine Tcl_GetByteArrayFromObj can then retrieve that same byte sequence. When the string representation of such a value is needed, each byte (with value from 0-255) in the sequence is treated as the corresponding UCS-2 codepoint (U+0000 - U+00FF), and that UCS-2 sequence is encoded in Modified UTF-8 in the usual way. This strategy permits all byte sequences to be encoded in a subset of Tcl string values.

When Tcl_GetByteArrayFromObj is called on a value where no byte sequence has been stored, a byte sequence is generated from the string representation. When the string is in the subset of strings that can be produced by encoding byte sequences, the decoding is clear. For other string values, those that contain at least one codepoint greater than U+00FF, it was decided that any larger codepoint in the string value would have its high bits stripped away, and be decoded based on the contents of the low 8 bits it contained. Given this decision, all strings produce a byte sequence, and Tcl_GetByteArrayFromObj would always return a non-NULL pointer to a byte sequence. The routine did not need specify any mechanism for raising errors. This decision is the source of all the trouble.

When a caller of Tcl_GetByteArrayFromObj receives access to a byte sequence, it does not know whether this is a sequence originally stored, or one generated by transforming and possibly truncating characters from a general Tcl string value. This means the contents of the byte sequence do not reliably reveal information about the value. We could not say, for example, what the 3rd character in the value is. At best we could say what is left when all high bits are stripped off that character. It is not a common need to treat all string values according to the equivalence classes established by examining only the low-bytes of every character.

If we supplement the call to Tcl_GetByteArrayFromObj with a call to Tcl_HasStringRep, we might learn that the value does not have a string representation stored within it. In that case, the value is what we have come to call a pure bytearray value. We can then be sure the byte sequence is an original one. It did not come from a string since there is no string. We can then use the byte sequence as the definitive value. This helps, but only so much. If anything causes the string representation to be generated, we lose this supplementary test, and we are back to being unable to use the byte sequence at all.

This is tricky enough, callers of Tcl_GetByteArrayFromObj get it wrong, even the callers within Tcl itself. See Tcl bugs 0e92c404f1 and 2637173 and this interactive demonstration:

```
>	% info patchlevel
>	8.5.15
>	% set s \u0141
>	Ł
>	% binary scan $s c x
>	1
>	% string index $s 0
>	A
```

In Tcl 8.5.16+ these bugs are fixed. In Tcl 8.5 and 8.6, the testing for pure bytearray values is performed, so the wrong results are no longer produced. However, this approach still loses all utility of byte array internal representations as soon as a string representation is generated. In Tcl 8.7, the internals have been reworked so that two distinct Tcl_ObjTypes are in use. One of these implements a proper byte array Tcl_ObjType that rejects string values that contain any non-byte character. The other tolerates such strings, so that the behavior of Tcl_GetByteArrayFromObj can be continued without any incompatibility.

The next evolutionary step is to incompatibly revise the documented interface for Tcl_GetByteArrayFromObj so that it can return NULL and so that each caller is burdened with handling that possibility. The routine Tcl_SetByteArrayLength has the same concerns and should change in the same way.

With the plan to allow extraction of bytes from a value to fail, it becomes clear in hindsight that a routine following the second pattern would be more appropriate, so that Tcl can offer the service of generating a standard error message and error code.

Specification

In Tcl 8.7, create a new public routine,

unsigned char * Tcl_GetBytesFromObj(Tcl_Interp *, Tcl_Obj *, int *);

which attempts to extract a byte sequence representation of the value. If the value contains any character outside the byte range, it does not have a valid representation as a byte sequence. In that circumstance, the value NULL is returned, and if the interpreter argument is not NULL, standard error messages and codes are left in it. Otherwise, a pointer to the byte sequence representation is returned (after creating it, if necessary) and the number of bytes in the sequence is written to the storage at the output pointer, if it is not NULL.

[binary scan] [binary encode]

Compatibility

Scope

This TIP proposes a change to Tcl_GetByteArrayFromObj that is an incompatibity in its return value. There is also interest (see [TIP 481]) in making another incompatible change to the same routine in the type of its final output argument. It is out of scope to merge these concerns. They can be accompilshed as orthogonal matters in either sequence.

Reference Implementation

See branches dgp-properbytearray (9.0) and tip-568 (8.7).

Copyright

This document has been placed in the public domain.