Tcl Improvement Proposals: 69.md at [151b2972a9]

File tip/69.md artifact 427547b11e part of check-in 151b2972a9

# TIP 69: Improvements for the Tcl Hash Table
	Author:         George A. Howlett <[email protected]>
	Author:         Don Porter <[email protected]>
	Author:         Donal K. Fellows <[email protected]>
	State:          Draft
	Type:           Project
	Vote:           Pending
	Created:        16-Oct-2001
	Post-History:   
	Discussions-To: news:comp.lang.tcl
	Tcl-Version:    9.1
-----

# Abstract

This document describes various improvements to the existing Tcl hash
table.  They include support for 64 bit platforms, better memory
performance, and improved array hashing.  The goal is a hash table
that improves Tcl/Tk, but also can be used in industrial strength
applications.

# Introduction

A strength of Tcl that has not diminished in the advance of other
scripting languages \(Perl, Python, Ruby, etc.\) is the easy way its
command language can be extended with C/C\+\+ code.  For example, the
prominence of Tcl in Electronic Design Automation \(EDA\) tools is
striking.  It's hard to find EDA tools that do not use Tcl to some
degree.  At the same time, there is a current trend toward 64-bit
computing platforms.  The impetus has been from industry \(like EDA\)
rather than office or home users, wanting to solve bigger problems,
faster.  If Tcl applications are to operate on 64-bit platforms, a big
first step towards this goal will be a 64-bit version of the Tcl hash
table.

The current Tcl hash table performs well on 32-bit platforms.  It has
been tuned and wrung out handling internal Tcl/Tk code.  But its one
word hash function

	 #define RANDOM_INDEX(tablePtr, i) \
	    (((((long) (i))*1103515245) >> (tablePtr)->downShift) & (tablePtr)->mask)

can not hash the longer 64-bit addresses properly.  

Example:

	    Tcl_HashTable t;
	    unsigned long i;
	    char *base, *addr;	
	    int isNew;
	    char *mesg;
	
	    Tcl_InitHashTable(&t, TCL_ONE_WORD_KEYS);
	    base = 0xFFFFFF000000000UL;
	    for (i = 0; i < 100; i++) {
	        addr = base + i * 0x100000000;
	        hPtr = Tcl_CreateHashEntry(&t, addr, &isNew);
	    }
	    mesg = Tcl_HashStats(&t);
	    fprintf(stderr, "Stats\n%s\n", mesg);
	    free((char *)mesg;

Note that the keys all have zeros in the lower 32 bits.  All 100
entries will hash to the same value.  Driving the need for 64-bit
systems is the ability to address more memory.  So it's imperative
that Tcl hash table be able to hash large virtual addresses.

Building upon the current hash table implementation, the following
sections describe specific areas for improvement:

	* improved array/structure hashing

	* better memory performance 

	* support for 64-bit platforms

The goal is an improved Tcl hash table for internal Tcl and Tk code,
but also high performance applications.

# Improved Array/Structure Hashing

The Tcl hash table handles three types of hash keys: string keys, one
word keys, and multi-word keys.  Each key type has its own hash
function associated with it.  The benefit of this approach is that
better hash functions can be used for the specific types, than one
general function for all types.  The string and one word hash
functions are very good for typical keys.  The multi-word or array
hash is not as good.

The array hash sums the each word of the array and then randomizes
the result.  

	    for (index = 0, count = tablePtr->keyType, iPtr1 = arrayPtr;
		    count > 0; count--, iPtr1++) {
		index += *iPtr1;
	    }
	    index = RANDOM_INDEX(tablePtr, index);

This works poorly for many types of hash keys.  For an contrived
example of hashing 1 million 3D coordinates,

	    typedef struct {
	       double x, y, z;
	    }  Double3;
	    double d3;
	
	    Tcl_InitHashTable(&t, sizeof(Double3) / sizeof(int));
		
	    for (i = 0; i < 100; i++) {
		for (j = 0; j < 100; j++) {
		    for (k = 0; k < 100; k++) {
			d3.x = (double)i;
			d3.y = (double)j;
			d3.z = (double)k;
			hPtr = Tcl_CreateHashEntry(&t, (char *)&d3, &isNew);
		    }
		}
	    }

we get a hash table with an average search distance of 1082.3.  The
maximum distance is 3324!  Replacing the hash function with Bob
Jenkins' <http://burtleburtle.net/bob>  32-bit mixing function

	   #define MIX32(a,b,c) \
		a -= b, a -= c, a ^= (c >> 13), \
		b -= c, b -= a, b ^= (a <<  8), \
		c -= a, c -= b, c ^= (b >> 13), \
		a -= b, a -= c, a ^= (c >> 12), \
		b -= c, b -= a, b ^= (a << 16), \
		c -= a, c -= b, c ^= (b >>  5), \
		a -= b, a -= c, a ^= (c >>  3), \
		b -= c, b -= a, b ^= (a << 10), \
		c -= a, c -= b, c ^= (b >> 15)
	
	   int a, b, c, len;
	
	   len = length;
	   a = b = GOLDEN_RATIO32;	/* An arbitrary value */
	   c = 0;			/* Previous hash value */
	
	   while (len >= 3) {		/* Handle most of the key */
		a += key[0];
		b += key[1];
		c += key[2];
		MIX32(a, b, c);
		key += 3; len -= 3;
	    }
	    c += length;		
	    switch(len) {
	    case 2 : b += key[1];
	    case 1 : a += key[0];
	    }
	    MIX32(a, b, c);
	    return c;

yields a table with an average search distance of 1.48.  The maximum
distance is 8.  The Jenkins' hash function provides good results for
many different types of arrays and structures.  The disadvantage is
that the hash function is slightly more expensive to compute.  

# Improving RebuildTable.

The cost of computing a hash function is especially felt each time
table is rebuilt as new entries are added.  The _RebuildTable_ function
calls the hash function of each entry to recompute its new location in
the bigger table.

	    for (oldChainPtr = oldBuckets; oldSize > 0; oldSize--, oldChainPtr++) {
		for (hPtr = *oldChainPtr; hPtr != NULL; hPtr = *oldChainPtr) {
		    *oldChainPtr = hPtr->nextPtr;
		    if (tablePtr->keyType == TCL_STRING_KEYS) {
			index = HashString(hPtr->key.string) & tablePtr->mask;
		    } else if (tablePtr->keyType == TCL_ONE_WORD_KEYS) {
			index = RANDOM_INDEX(tablePtr, hPtr->key.oneWordValue);
		    } else {
			index = HashArray(hPtr->key.words, tablePtr->keyType) &
				tablePtr->mask;
		    }
		    hPtr->bucketPtr = &(tablePtr->buckets[index]);
		    hPtr->nextPtr = *hPtr->bucketPtr;
		    *hPtr->bucketPtr = hPtr;
		}
	    }

The new bucket location is then stored in the hash entry.

Except for one word keys, the hash value is invariant of the table
size.  If the hash value was stored with each entry, then it would not
need to be recomputed each time the table is rebuilt.

	    for (oldChainPtr = oldBuckets; oldSize > 0; oldSize--, oldChainPtr++) {
		for (hPtr = *oldChainPtr; hPtr != NULL; hPtr = *oldChainPtr) {
		    *oldChainPtr = hPtr->nextPtr;
		    if (tablePtr->keyType == TCL_ONE_WORD_KEYS) {
			index = RANDOM_INDEX(tablePtr, hPtr->key.oneWordValue);
		    } else {
			index = hPtr->hval & tablePtr->mask;
		    }
		    hPtr->bucketPtr = &(tablePtr->buckets[index]);
		    hPtr->nextPtr = *hPtr->bucketPtr;
		    *hPtr->bucketPtr = hPtr;
		}
	    }

This would increase size of an hash entry, except that the pointer to
the hash bucket is now redundant, since it can cheaply be computed.

	    bucketPtr = tablePtr->buckets + (hPtr->hval & tablePtr->mask);

An added benefit is that hash table lookups become faster and easier
to perform.  If there is more than one hash entry in a bucket, you
don't need to examine the key unless the entry has the same hash
value.

	    for (hPtr = tablePtr->buckets[hindex]; hPtr != NULL;
		    hPtr = hPtr->nextPtr) {
	       /* Don't look at entry unless the hash value is the same. */
		if (hPtr->hval == hval) { 
		    register int *iPtr1, *iPtr2;
		    int count;
	
		    for (iPtr1 = arrayPtr, iPtr2 = (int *)hPtr->key.words,
			     count = tablePtr->keyType; ; count--, iPtr1++, iPtr2++) {
			if (count == 0) {
			    return hPtr;
			}
			if (*iPtr1 != *iPtr2) {
			    break;
			}
		    }
		}
	    }

_Don Porter <[email protected]>_

 > _It appears that the recommendations of this section have already
   been implemented in Tcl 8.4.  In particular, when the symbol
   TCL\_HASH\_KEY\_STORE\_HASH == 1 \(as it does by default\), then the
   hash value is stored in each entry instead of the bucketPtr._

 > _If that is correct, then I recommend this section of the TIP be
   removed.  If not, more detail about how this proposal differs
   from the 8.4 implementation, and an argument why the proposal
   is superior are in order._

# Better Memory Performance 

One enduring complaint of the Tcl hash table on comp.lang.tcl is its
unexpected memory costs.  A table of 1 million one word key entries
uses over 36 Megabytes.

A hash entry is 20 bytes long.

	 typedef struct Tcl_HashEntry {
	    struct Tcl_HashEntry *nextPtr;	/* Pointer to next entry in this
						 * hash bucket, or NULL for end of
						 * chain. */
	    struct Tcl_HashTable *tablePtr;	/* Pointer to table containing entry. */
	    struct Tcl_HashEntry **bucketPtr;	/* Pointer to bucket that points to
						 * first entry in this entry's chain:
						 * used for deleting the entry. */
	    ClientData clientData;		/* Application stores something here
						 * with Tcl_SetHashValue. */
	    union {				/* Key has one of these forms: */
		char *oneWordValue;		/* One-word value for key. */
		int words[1];			/* Multiple integer words for key.
						 * The actual size will be as large
						 * as necessary for this table's
						 * keys. */
		char string[4];			/* String for key.  The actual size
						 * will be as large as needed to hold
						 * the key. */
	    } key;				/* MUST BE LAST FIELD IN RECORD!! */
	 } Tcl_HashEntry;

Each entry stores a pointer to its hash table.  This field is used
only for deleting a hash entry.  But if the hash table is passed to
_Tcl\_DeleteHashEntry_, the hash entry can be reduced to 16 bytes.
Inspecting Tcl/Tk code, I could not find a case where the hash table
was not easily available to pass as a parameter.

Each hash entry is allocated using _malloc_.  System memory
allocators typically add 8-16 bytes overhead for each allocation.
Worse, calls to _malloc_ and _free_ tend to dominate the cost of
large hash tables.  _Tcl\_DeleteHashTable_ becomes very slow, freeing
hash entries scattered across pages of virtual memory.

For large hash tables, a pool allocation scheme can improve both
reduce the amount of memory used and improve memory performance.  By
allocating memory in larger chunks, the number of _malloc_ and
_free_ calls is dramatically reduced.  Fixed size allocators \(one
word keys and array keys\) can also reclaim and reuse memory from
deleted entries.

The disadvantage of pool allocation is that memory is not released
until the hash table is deleted.  This is less of an issue for large
tables which tend to grow to a steady-state size.  Both Tcl and Tk use
hash tables to keep track of small amounts of information that
probably don't pool allocation.

So to retain compatibility, a new specialized initialization routine
can be used to indicate when to use pool-based allocation.

	    Tcl_InitHashTableWithPool(&table, TCL_ONE_WORD_KEYS);

The standard _Tcl\_InitHashTable_ call

	    Tcl_InitHashTable(&table, TCL_ONE_WORD_KEYS);

will still use _malloc_ and _free_.

# Support for 64-bit Platforms.

While the C language makes no guarantees of a type's size or its
relation to other types, current programming practice assumes that
integers, longs, and pointers are all 32 bits long.  This, of course,
changes with 64-bit systems where pointers are 64-bits wide.
Depending upon the programming model, longs and ints may or may not be
64 bits too.

		 Datatype      LP64    ILP64   LLP64   ILP32   LP32
		  char           8       8       8       8       8
		  short         16      16      16      16      16
		  _int32                        32
		 int            32      64      32      32      16
		 long           64      64      32      32      32
		 long long                      64
		 pointer        64      64      64      32      32

ILP32 is typical for 32 bit systems.  Windows 3.1 was a LP32 model.

In the LP64 model, pointers and longs are 64 bits, but ints remain 32
bits wide.  The LLP model retains the 32-bits for ints and longs, but
adds a 64-bit "long long" type.  Most 64-bit Unix systems \(Solaris,
HP-UX, Tru64, AIX\) are LP64.  I believe that Win64 is LLP.

The first problem is that addresses are now 64-bits, not 32.  This
means that existing code such as

	    Tcl_InitHashTable(&table, TCL_ONE_WORD_KEYS);
	
	    ptr = CreateSomeObject();
	    hPtr = Tcl_CreateHashEntry(&table, (void *)ptr, &isNew);

can possibly fail because the 32-bit one word hash function can't
properly hash the 64-bit pointer address.

_Don Porter <[email protected]>_

 > _Pardon the interruption, but I do not understand what is meant
   by the assertion that hashing of 64-bit pointers "can possibly
   fail".  I've used Tcl on a 64-bit Alpha system for years, hashing
   64-bit pointers the whole time.  What failures should I be seeing?_

	 #define RANDOM_INDEX(tablePtr, i) \
	    (((((long) (i))*1103515245) >> (tablePtr)->downShift) & (tablePtr)->mask)

The above one word hash function can be replaced with a 64-bit version
of Donald Knuth's multiplicative hash function.

	    ((key * GOLDEN_RATIO64) >> downShift) & tablePtr->mask)

where downShift is 64 - log2\(tableSize\) and the GOLDEN\_RATION64 is a
prime approximately equal to \(sqrt\(64\) - 1\) / 2.  

The 64-bit array function is again from Bob Jenkins.  This time it's a
64-bit mixing function.

	 #define MIX64(a,b,c) \
	 	a -= b, a -= c, a ^= (c >> 43), \
	 	b -= c, b -= a, b ^= (a <<  9), \
	 	c -= a, c -= b, c ^= (b >>  8), \
	 	a -= b, a -= c, a ^= (c >> 38), \
	 	b -= c, b -= a, b ^= (a << 23), \
	 	c -= a, c -= b, c ^= (b >>  5), \
	 	a -= b, a -= c, a ^= (c >> 35), \
	 	b -= c, b -= a, b ^= (a << 49), \
	 	c -= a, c -= b, c ^= (b >> 11), \
	 	a -= b, a -= c, a ^= (c >> 12), \
	 	b -= c, b -= a, b ^= (a << 18), \
	 	c -= a, c -= b, c ^= (b >> 22)
	 
	     uint64_t a, b, c, len;
	 
	     len = length;
	     a = b = GOLDEN_RATIO64;	/* An arbitrary value */
	     c = 0;			/* Previous hash value */
	 
	     while (len >= 3) {	/* Handle most of the key */
	 	a += key[0];
	 	b += key[1];
	 	c += key[2];
	 	MIX64(a,b,c);
	 	key += 3; len -= 3;
	     }
	     c += length;		
	     switch(len) {
	     case 2 : b += key[1];
	     case 1 : a += key[0];
	     }
	     MIX64(a,b,c);
	     return c;

Note that it also takes advantage of the 64-bit word size.

# Summary

The following improvements to the current Tcl hash table have been
suggested.  

 * Replace the current array hash function.

 * Replace the bucket pointer in the hash entry with its hash value.
   This allows the table to be rebuilt without rehashing each entry.
   It also speeds bucket searches.

 * Remove the tablePtr from the hash entry, a 20% savings.  This
   requires that callers of _Tcl\_DeleteHashEntry_ pass the hash
   table as a parameter.

 * Allow the hash table to use fixed or variable size pool allocation
   since _malloc_ and _free_ costs dominate large tables.  Pool
   allocation substantial speeds large tables while also saving 8-16
   bytes per entry.  This can be done while still providing the normal
   _malloc_/_free_ versions.

 * Support 64-bit platforms.  This requires 64-bit versions of one
   word and array hash functions.

The suggested changes are nothing new and can be found in most hash
table implementations.  This work builds on the already solid
foundation of the current hash table.  With the above improvements,
the Tcl hash table can be used in high performance applications.  It
also adds a useful piece to the 64-bit Tcl/Tk port.

I've created and tested a new hash table implementation under the
following systems.

		System			32     64
		linux-ix86-gcc		x	
		Solaris-v9-cc		x	x
		Solaris-v9-gcc		x	x
		HPUX-11-cc		x	x
	 	HPUX-11-gcc		x
		Win2k			x

It will be made publicly available on SourceForge.

# Hashing of Malicious Strings

_Donal K. Fellows adds:_

In 2003 a possible denial-of-service attack on hash tables was published
that operated by making a majority of keys map to the same bucket.  While
this would not make the hashes function incorrectly - there would be no
extra memory consumed or incorrect accesses to memory - it still permits
an attacker to escalate the cost of hash accesses from O\(1\) to O\(n\) in
the normal case \(and with obvious knock-on effects for the order of other
algorithms\) and so mount an attack out-of-scale with the effort required
to set the attack up.

The way to fix this is to use a different hashing function for string
hashing that varies the exact hashing algorithm on a table-by-table
basis, and to base that algorithm on a hashing function with better
spectral properties than Tcl's current \(extremely simple\) one.  An
algorithm that might be suitable for such uses is described online
<http://burtleburtle.net/bob/hash/evahash.html>  though the code would
need substantial adaption \(including the addition of a fairly strong
random number generator\) before being placed in the core.

# Copyright

This document has been placed in the public domain.