TIP 636: Expand List Flexibility with Abstract List Tcl_Obj Type

Login
Author:         Brian Griffin <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Votes-Summary:  4/0/0
Votes-For:      SL, DF, KW, BG
Votes-Against:  none
Votes-Present:  none
Created:        1-Sep-2022
Tcl-Version:    9.0
Tcl-Branch:     tip-636-tcl9-644

Abstract

Extend Tcl_ObjType to support native List operation on non-List value types. Functions provided by a custom ObjType will be use in place of the internal List operation, avoiding runtime ObjType conversion.

Rationale

The most frustrating part of implementing custom Tcl_Obj types is that they will be converted to a built-in Tcl type when performing typical Tcl operations on the value. This removes the advantages of having a custom type since the custom extension will have first convert the value back to its original form before it can be used. (This is referred to as "shimmering") The reason for having a custom type is usually due to either computational advantage, space advantage, or both. Maintaining this advantage is a worthy goal. Having access to built-in Tcl list operations makes the custom type even more worthwhile.

Specification

The Tcl_Obj type will be extended by adding functions that implement List operations on the given value. The functions mimic the List command behaviors without changing the Obj Type representation. By providing these functions, some or all of the script level List operations, such as lindex, llength, and foreach, would be supported for a custom type without causing the obj value to shimmer to the List type.

Here is a simple example of how this works. A simple scalar value, like a double numeric has an internal representation of "double". However, if the value is passed to [llength] command, the type will change to a List type:

    % set n [expr {15.3}]
    15.3
    % representation $n
    value is a double with a refcount of 2, object pointer at
    0x7fa3b512adc0, internal representation 0x402e99999999999a:0x0,
    string representation "15.3"

    % llength $n
    1
    % representation $n
    value is a list with a refcount of 2, object pointer at
    0x7fa3b512adc0, internal representation 0x7fa3b5864a50:0x0, string
    representation "15.3"

If the "double" type adds a lengthProc function to the type definition, then the [llength] command will correctly return "1" without having to shimmer the value to a "list".

This Tcl_ObjType is extended with slots for the following function pointers:

    /* Abstract List functions */
    Tcl_Size    (*lengthProc)(obj)
    int         (*indexProc)(interp, obj, index, elemObjPtr)
    int         (*sliceProc)(interp, obj, from, to, newObjPtr)
    int         (*reverseProc)(interp, obj, newObjPtr)
    int         (*getElementsProc)(interp, obj, start, count, objcPtr, objvPtr)
    int         (*replaceProc)(interp, obj, first, numToDelete, numToInsert, insertObjs)
    Tcl_Obj*    (*setElementProc)(interp, listPtr, indicies, valueObj)

These functions are optional and when absent, the List operation will revert to the base List operation behavior. For example, if the reverseProc is NULL, the [lreverse] operation will first shimmer the value to a List, then perform the operation.

The lengthProc is necessary if any other List function is also provided. The indexProc function is also used in some fallback operations.

Hopefully the definition of each of these functions is clear by their respective names.

Implementation Note On The Tcl_ObjType struct

TIP-644 added a version field to the Tcl_ObjType structure to allow for extending the struct with new features. The implementation of Abstract Lists makes use of this with the intent of making the interface as simple to use and maintain as possible, both internally and externally. The version field is defined as Tcl_Size which will be a monotonically increasing integer, starting at 0, for each future definition change.

The Seven function pointers added to the Tcl_ObjType give custom types control over the set of script and C API level List operations. Macros are provided for ease of use and backward compatibility. Existing custom ObjTypes can use the 'TCL_OBJTYPE_V0' macro for existing (pre 9.0) definitions. Any type that is not intended to be treated as a list, and wants to prevent converting to a list, e.g. Tcl_WideInt, can use the 'TCL_OBJTYPE_V1(a)' macro and provide an lengthProc that always returns

  1. An Abstract List implementation uses the 'TCL_OBJTYPE_V2(...)' macro, and defines at least 2 of the complete set of functions.

Incompatible Changes

One outcome of Abstract Lists is that an abstraction is no longer required to retain a copy of a list element Tcl_Obj. This is a significant change in that the current List type always retains a copy of a list element Tcl_Obj, and consequently, most all calls to retrieve an element fail to manage the reference count of the Tcl_Obj, implicitly relying on the List to manage it. An abstract list, on the other hand, can generate elements on-the-fly, and may not need to retain the Tcl_Obj. This leaves the responsibility of Tcl_Obj memory management up to the caller. To be fair, this is always the case for reference counted managed memory.

Extensions for Tcl and applications that embed Tcl, will need to check and add appropriate RefCount management to the implementation if necessary. This has been done for the Tcl core tip branch.

Here is an example of a code snippet that illustrates the problem (randomly chosen from the tcl core.) Take note of the treatment of itemObj below:

    Tcl_ListObjLength(NULL, zshPtr->outData, &listLen);
    if (count < 0) {
        count = 0;
        for (i=0; i<listLen; i++) {
            Tcl_ListObjIndex(NULL, zshPtr->outData, i, &itemObj);
            (void) Tcl_GetByteArrayFromObj(itemObj, &itemLen);
            if (i == 0) {
                count += itemLen - zshPtr->outPos;
            } else {
                count += itemLen;
            }
        }
    }

The above code inspects itemObj, but never checks or modifies the refCount. This is ok for Lists prior to this TIP. For Abstract Lists, itemObj could be newly created and have a refCount of 0. Completely safe code should traditionally be written as:

    Tcl_ListObjLength(NULL, zshPtr->outData, &listLen);
    if (count < 0) {
        count = 0;
        for (i=0; i<listLen; i++) {
            Tcl_ListObjIndex(NULL, zshPtr->outData, i, &itemObj);

            Tcl_IncrRefCount(itemObj);

            (void) Tcl_GetByteArrayFromObj(itemObj, &itemLen);
            if (i == 0) {
                count += itemLen - zshPtr->outPos;
            } else {
                count += itemLen;
            }

            Tcl_DecrRefCount(itemObj);

        }
    }

For a traditional List, these 2 additional calls will not effect the results. For an abstract list, a new Obj will be appropriately free'd.

The Incr/Decr calls in this example introduce an inefficiency that justified making an implied assumption about List elements. Since Abstract Lists changes the basis of this assumption, a different approach to improve efficiency is needed. In these cases, deferring reference count updates can be employed. It is for this reason that as part of the Abstract List implementation, a new refCount function has been added for this scenario:

    Tcl_BumpObj(objPtr)

It would be used like this:

    Tcl_ListObjLength(NULL, zshPtr->outData, &listLen);
        if (count < 0) {
        count = 0;
        for (i=0; i<listLen; i++) {
            Tcl_ListObjIndex(NULL, zshPtr->outData, i, &itemObj);
            (void) Tcl_GetByteArrayFromObj(itemObj, &itemLen);
            if (i == 0) {
                count += itemLen - zshPtr->outPos;
            } else {
                count += itemLen;
            }

            Tcl_BumpObj(itemObj); // Deferred refCount update

        }
    }

This function effectively executes:

    Tcl_IncrRefCount(objPtr);
    Tcl_DecrRefCount(objPtr);

But actually implemented as:

    if (objPtr) {
        if ((objPtr)->refCount == 0) {
            Tcl_DecrRefCount(objPtr);
        }
    }

For more information on List associated refCount issues, see related TIP-192.

And there is this rabbit hole: minimizing reference count updates .

Requirements and Caveats

In all cases, the string representation for an Abstract List value must always be a valid List as defined by the Tcl language syntax and the [list] command. An Abstract List will always successfully "SetListFromAny" to a List type, memory limits aside. It is not, however, a requirement that a string representation of a list be recognized by a particular Abstract List Type; there are some cases where this is not reasonably possible. For example, the ArithSeries, used in the lseq command, has only start, step, and count stored in the internal representation and uses math to compute indexed values. Taking a list of number and computing the start, step, and count and then validate, although possible, may not be worth the effort if vary large lists are involved.

Examples

An existing example of this concept is the new lseq command (TIP-629), which implements a custom ArithSeries type that contains 4 values that represent a list. The length and index values are computed on demand based on 3 of the 4 numeric values by simple arithmetic. It does not compute and fill a complete list up front. Shimmering is thus prevented in many cases. See branch tip-636-tcl9 which (re)implements the ArithSeries using AbstractLists. Branch tip-636-tcl9-644 implements the in-core approach.

Other examples of possible Abstract Lists

It turns out that [foreach] command already implements special treatment for the dict type in order to avoid shimmering of a dict type value to a list type. It would be possible to avoid many, if not all, dict to list shimmering by using the Abstract List approach.

Other suggested examples:

Implementation

See branch tip-636-tcl9-644

Impact

On entry to the various list operations, both in the command implementation and in the bytecode engine, an initial test is made to see if the Obj is an AbstractList. A test is already made to see if the Obj is a List before calling SetListFromAny. The impact should be minimal. Note also that a similar test already exists in some limited cases for the Dict type.

It is conceivable that the traditional List implementation be (re)implemented as an AbstractList as well, then the additional overhead is simplified.

Alternatives

There is an "internal only" (partial) implementation on branch, and main (currently) internal-abstract-list

There is also a work-in-progress with a variation of abstract lists (or abstract types?) at pyk-objinterface.

Acknowledgment

Note: this is work that germinated from TIP-225. Salvatore Sanfilippo and Miguel Sofer should be recognized for their contribution.