Tcl Source Code

Hash Tables
Login

Source Files

Public Interface

Private Interface

Directly Depends On Public Interface

Directly Depends On Private Interface of

Discussion

The key feature of the interface of this module is that the central struct definitions are open. This severely limits the freedom to further develop the functionality. Opaque structs tend to make for a better encapsulation of private vs public concerns.

The open structs do offer some features that opaque structs would not permit. The caller of these routines may place their Tcl_HashTable in static storage and eliminate whatever cost is paid in dynamic allocation of the storage, as well as the housekeeping burden of arranging for the release of that storage. Also, many "functions" in the interface may be implemented as macros that directly access the known fields of the open struct, and that might have some performance benefit.

It is possible to still have evolution with open struct definitions, but it requires significant effort to maintain all the necessary compatibilities. The Tcl_HashKeyType struct possesses the first enabler of such a scheme -- a version field, but there's been either no need or no ability to revise that struct. In contrast, the Tcl_HashEntry struct has effectively had its bucketPtr field replaced by a hash field but without a version field in place. So long as callers do not directly touch either of these fields, but leave them solely for the Tcl routines to operate on, then the Tcl build-time configurability to use either implementation by setting of TCL_HASH_KEY_STORE_HASH to 1 or not continues to be an option. But that just underlines the point that these fields are functionally meant to be private and accessed only by Tcl. Better to have form and function in agreement. There's evidence that at some point in the history the Tcl_HashTable struct has also been extended (to grow the typePtr field). That change also took place without benefit of a version field, so some difficult to explain and maintain code is in Tcl_InitCustomHashTable.

There is no thread safety built into the Tcl_HashTable system. Each Tcl_HashTable either has to be kept in thread isolation, or has to have operations on it contained within mutex protection by the callers.

The customizable hash function is defined to return an unsigned int, so only 2^32 distinct hashes can be used to sift data. According to the internal logic of the hash table implementation, this means the existing architecture is suitable for holding at most 3 * 2^32 entries. If we aim to store more than that, the Tcl_HashKeyType will need to evolve.

In practice, that's not a problem because before we reach those sizes we run into a fatal bug in RebuildTable, where the combination of the limited range of ckalloc() and a failure to manage overflow of the numBuckets value brings the program crashing down.

The bucket array only grows, never shrinks.

There has been some interest in addressing the potential vulnerability of hashing collision attacks. The difficulty is that the existing hashing regime has a record of performing very well, and having a large impact on overall Tcl performance. Solutions that truly reduce the hazards of collision attacks tend to be costly in performance. It has been difficult to generate evidence to support the claims of any alternatives to simultaneously perform well and offer protection.

Other hashing criticisms have been raised occasionally that on some input sets, the existing hashes do not randomize into the full 32 bits of hash value as well as they should for best operations.

There are runtime configuration options for these routines that do away with the dependence on Allocation as a runtime concern. This module can be self-contained in that sense.