TIP 568: Revise ByteArray Routines To Support Proper Value Extraction

Login
Author:         Don Porter <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        4-Mar-2020
Post-History:
Keywords:       bytearray
Tcl-Version:    9.0
Tcl-Branch:     dgp-properbytearray
Tcl-Branch:     tip-568
Vote-Summary:   Accepted 6/0/1
Votes-For:      DP, KK, JN, SL, AK, KW
Votes-Against:  none
Votes-Present:  FV

Abstract

This TIP proposes revision of the public routines Tcl_GetByteArrayFromObj and Tcl_SetByteArrayLength to signal an error when they are passed an objPtr that does not contain a valid byte sequence, starting with Tcl 9.0. It also proposes a new routine Tcl_GetBytesFromObj to provide robust bytearray use in Tcl 8.7.

Background

It is a convention of the Tcl C interface to have routines with names like Tcl_GetFooFromObj. Their purpose is to extract a value in the foo representation from a (Tcl_Obj *) argument. The caller should then be able to process the extracted foo value confident that it represents the original Tcl_Obj value. Each one of these routines is associated with one or more Tcl_ObjTypes such that multiple calls to Tcl_GetFooFromObj can benefit from internal representation caching.

These routines follow one of three patterns. The first,

int Tcl_GetFooFromObj(Tcl_Interp *, Tcl_Obj *, ..., foo *);

is used when the foo representation is of fixed size that is not too large. The int return allows for returning TCL_ERROR when no valid foo can be extracted. The Tcl_Interp is provided to receive standardized error messages and codes on failure. The final output argument is a pointer to space where the extracted foo value may be written. The value written is now in the possession of the caller, generated by the routine, possibly by making a copy of something stored in the Tcl_Obj. This pattern is followed in the cases where Foo is Boolean, Bignum, Double, Encoding, Index, Int, Long, or WideInt. (See also private routines where Foo is Channel, Number, CompletionCode, WideBits, or Namespace.)

The second pattern,

foo Tcl_GetFooFromObj(Tcl_Interp *, Tcl_Obj *, ...);

is used when the foo representation is a token that can take on the value NULL. In this case, a return of NULL by the routine signals the circumstance that no valid foo can be extracted. As in the first pattern, a Tcl_Interp is provided to receive standardized error messages and codes on failure. Documentation must be consulted to determine any constraints on the use of the returned token value by the caller. It is likely to rely on information stored within the internal structures of Tcl, which may need management with reference counting, memory preservation, and/or maintaining a claim on the original Tcl_Obj. This pattern is followed in the cases where Foo is RegExp or Command. (See also private routine where Foo is Lambda.)

The final pattern,

foo Tcl_GetFooFromObj(Tcl_Obj *, SIZE *);

is used when the returned foo is a pointer value pointing into an array stored within Tcl's own structures. It is implicit in the structure of the interface that it is presumed this extraction cannot fail. Thus there is no need to arrange for an interpreter to receive error messages or codes. The caller is instructed that the return value need not be checked for NULL. The final output argument is a pointer to space where the length of the foo array returned may be written. The length value written is of type SIZE which may be either int or Tcl_Size (see [481]). Documentation must be consulted to determine the conditions under which the caller may read, or even write to that array. This pattern is followed in the cases where Foo is String, Unicode, or ByteArray.

Again it must be understood that all of these routines have value only so far as they provide a foo value that the caller can use in the place of operating directly on the Tcl_Obj. Experience has proven that the existing specification of the routine Tcl_GetByteArrayFromObj fails that test.

History and Rationale

In Tcl 8.0, every Tcl string value was equivalent to an arbitrary sequence of bytes. Starting with release 8.1, the storage and transmission of Tcl string values were revised to use a Modified UTF-8 encoding so that a larger number of international character sets could be supported. This created two important facts pertaining to arbitrary binary data: (1) For reliable binary transfer, we could no longer simply write arbitrary bytes to a string representation; (2) General Tcl strings may contain characters with codepoints outside the byte range (U+0000 - U+00FF).

The rules for encoding string values in Tcl 8.1+ created the need for a new mechanism to accept, transmit, store, and produce arbitrary binary values, preferably while minimizing the need to convert to other representations. The bytearray Tcl_ObjType was created to address this need. The routine Tcl_NewByteArrayObj stores an arbitrary byte sequence in a Tcl_Obj. The routine Tcl_GetByteArrayFromObj can then retrieve that same byte sequence. When the string representation of such a value is needed, each byte (with value from 0-255) in the sequence is encoded as the corresponding codepoint (U+0000 - U+00FF), and that codepoint sequence is encoded in Modified UTF-8 in the usual way. This strategy permits all byte sequences to be encoded in a subset of Tcl string values.

When Tcl_GetByteArrayFromObj is called on a value where no byte sequence has been stored, a byte sequence is generated from the string representation. When the string is in the subset of strings that can be produced by encoding byte sequences, the decoding is clear. For other string values, those that contain at least one codepoint greater than U+00FF, it was decided that any larger codepoint in the string value would have its high bits stripped away, and be decoded based on the contents of the low 8 bits it contained. Given this decision, all strings produce a byte sequence, and Tcl_GetByteArrayFromObj would always return a non-NULL pointer to a byte sequence. The routine did not need specify any mechanism for raising errors. This decision is the source of all the trouble.

When a caller of Tcl_GetByteArrayFromObj receives access to a byte sequence, it does not know whether this is a sequence originally stored, or one generated by transforming and possibly truncating characters from a general Tcl string value. This means the contents of the byte sequence do not reliably reveal everything about the value. We could not say, for example, what the 3rd character in the value is. At best we could say what is left when all high bits are stripped off that character. It is not a common need to treat all string values according to the equivalence classes established by examining only the low-bytes of every character.

If we supplement the call to Tcl_GetByteArrayFromObj with a call to Tcl_HasStringRep, we might learn that the value does not have a string representation stored within it. In that case, the value is what we have come to call a pure bytearray value. We can then be sure the byte sequence is an original one. It did not come from a string since there is no string. We can then use the byte sequence as the definitive value. This helps, but only so much. If anything causes the string representation to be generated, we lose this supplementary test, and we are back to being unable to use the byte sequence at all, unless our usage is exactly to operate on the value according to the equivalence classes mentioned before.

This is tricky enough that many callers of Tcl_GetByteArrayFromObj get it wrong, even the callers within Tcl itself. See Tcl bugs 0e92c404f1 and 2637173 and this interactive demonstration:

```
>	% info patchlevel
>	8.5.15
>	% set s \u0141
>	Ł
>	% binary scan $s c x
>	1
>	% string index $s 0
>	A
```

In Tcl 8.5.16+ these bugs are fixed. In Tcl 8.5 and 8.6, the testing for pure bytearray values is performed, so the wrong results are no longer produced. However, this approach still loses all utility of byte array internal representations as soon as a string representation is generated. In Tcl 8.7, the internals have been reworked so that two distinct Tcl_ObjTypes are in use. One of these implements a proper byte array Tcl_ObjType that rejects string values that contain any non-byte character. The other tolerates such strings, so that the behavior of Tcl_GetByteArrayFromObj can be continued without any incompatibility. This preserves as much safe utility of the type as can be achieved without an interface incompatibility.

The next evolutionary step is to incompatibly revise the documented interface for Tcl_GetByteArrayFromObj so that it can return NULL and so that each caller is burdened with handling that possibility. The routine Tcl_SetByteArrayLength has the same concerns and should change in the same way.

With the plan to allow extraction of bytes from a value to fail, it becomes clear in hindsight that a routine following the second pattern would be more appropriate, so that Tcl can offer the service of generating a standard error message and error code.

Specification

In Tcl 8.7, create a new public routine,

unsigned char * Tcl_GetBytesFromObj(Tcl_Interp *, Tcl_Obj *, SIZE *);

which attempts to extract a byte sequence representation of the value. If the value contains any character outside the byte range, it does not have a valid representation as a byte sequence. In that circumstance, the value NULL is returned, and if the interpreter argument is not NULL, standard error messages and codes are left in it. Otherwise, a pointer to the byte sequence representation is returned (after creating it, if necessary) and the number of bytes in the sequence is written to the storage at the output pointer, if it is not NULL. Using the techniques from [481], the SIZE type holding the number of bytes may be either int or Tcl_Size .

In Tcl 9.0, revise routines Tcl_GetByteArrayFromObj and Tcl_SetByteArrayLength so that they return NULL whenever the value is not a proper binary sequence. Most callers will prefer to use the routine Tcl_GetBytesFromObj instead to benefit from error message and error code generation.

In Tcl 9.0, building on these changes in the internal supports for arbitrary binary data, the commands binary scan and binary encode are revised to reject arguments that are not valid binary data. When such a situation arises, it indicates a programming error. These are better detected and fixed than accommodated.

In Tcl 9.0, end registration of a Tcl_ObjType with the name bytearray. Callers of Tcl_GetObjType("bytearray") must be prepared to receive a NULL return value.

Compatibility

Most callers of Tcl_GetByteArrayFromObj are probably buggy, at least so far as they are not robust when operating on a value that is not valid binary data. If the programs are written so that invalid data never reaches these calls, they will not experience any incompatibility. Otherwise, the incompatibilities provide the tools needed to make the code robust in that situation.

Most callers will want to switch to Tcl_GetBytesFromObj as soon as they can start to require Tcl 8.7. It is a better design for the job, and it will not have futher incompatibilities when bridging into Tcl 9.

Any callers who cannot or will not adapt their code to use Tcl_GetBytesFromObj will continue to have Tcl_GetByteArrayFromObj that is substandard in exactly the same way it is now through all remaining Tcl 8 releases. Only the adoption of Tcl 9 will force them to change their code, and that change will be nothing more than checking a return value for NULL.

The incompatibilities in the values acceptable to binary scan and binary encode in Tcl 9 will cause no trouble in well-programmed code. Only code that allows arbitrary strings to be passed as an argument runs a risk of new errors, and such code is better discovered and improved.

Addendum

After TIP #660 was accepted, a lot of functions changed from using size_t to ptrdiff_t parameters. In order to prevent confusion, this change has been adapted in the TIP text above as well.

Reference Implementation

See branches dgp-properbytearray (9.0) and tip-568 (8.7).

Copyright

This document has been placed in the public domain.