TIP 534: Faster Hashing of Small Integers

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2019 Conference, Houston/TX, US, Nov 4-8
Send your abstracts to [email protected]
or submit via the online form by Sep 9.
Author:         Donal K. Fellows <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Created:        02-March-2019
Post-History:   
Keywords:       Tcl
Tcl-Version:	8.7
Tcl-Branch:     dkf-experimental-fast-number-hash

Abstract

This TIP proposes to change the Tcl hashing algorithm slightly to allow integers to be used as hash keys without computing their string representation.

Rationale

Tcl uses hashes of values in quite a few places. The ones that this TIP is concerned with are where we have associative arrays and dictionaries (where keys can be arbitrary Tcl_Obj* values). Sometimes in user code, we want to use “small” integers as keys (where “small” means that the value fits in a native 64-bit integer); this is highly useful for some algorithms where we have things indexed by number but don't have a dense collection of indices.

Right now, that means we have to calculate the string representation of that integer before we can hash it, a non-trivial cost (in both time and space) when many of these keys are in use. However, it is possible to create a hash function that can be computed from an integer in much less cost, so accelerating all code which uses normal integer keys in associative arrays and dictionaries.

Specification

Because it is critical that the hashcode of a value be the same whether it is computed from the string version of the key or from the integer directly, we need to alter the computing of all hashcodes from Tcl_Obj * values. Fortunately, the adjustment is actually very trivial when applied to strings: we simply use the bytes of the string as we do now, but in reverse order. By doing this, we can avoid calculating the string representation for an integer and replace it instead with a series of calculations mod 10 that yield the digits of the number, which are trivial to map to the character representations they would have had had the string been created. This turns out to be an overall win in normal use.

Obviously, there's also a need to be a bit careful with the equality comparison used when checking the entries in the hash chain, but that is not semantics-altering. Indeed, the only way in which this alters the visible semantics of Tcl is in changing the iteration order of associative arrays, and other users of Tcl_InitObjHashTable() (of which only encoding names and the method introspection capabilities of TclOO should be user exposed at all; neither of those have ever made any order guarantees). Dictionary ordering is entirely unaffected; that uses a separate mechanism.

Mixing integer keys and other value-type keys (or integers with string representations) in the same hash table may result in the benefits being lessened, but only at most back to our current behaviour and cost of operation.

Note that string representations, if present, must take precedence because we currently require, for example, 255 and 0xFF to be different keys. String representations maintain that behaviour; the second value cannot be an integer-without-string-representation in Tcl (because then it would be 255, and also equivalent to the syntactically different 0xff, 0o377, ...).

Implementation

See the dkf-experimental-fast-number-hash branch.

The additional cost to non-integer keys is exactly one extra comparison (of a pointer) against a constant. The additional cost to integer keys with a string representation is two extra comparisons (once to determine that they are integers and once to determine that they have a string representation). These are comparatively cheap tests.

Copyright

This document has been placed in the public domain.