TIP: 72
Title: 64-Bit Value Support for Tcl on 32-Bit Platforms
Version: $Revision: 1.10 $
Author: Donal K. Fellows <[email protected]>
State: Final
Type: Project
Vote: Done
Created: 05-Nov-2001
Post-History:
Tcl-Version: 8.4
~ Abstract
This TIP adds the capability to perform computations on values that
are (at least) 64-bits wide even on 32-bit platforms. It also adds
support for handling files that are larger than 2GB large on those
platforms (where supported by the underlying platform and filing
system).
~ Rationale
There have been a number of requests, and from a whole range of
application areas, for Tcl to be enhanced to handle 64-bit values even
on platforms where that is larger than the native word size, and the
vast majority of C compilers support a large enough arithmetic type
(often called ''long long'' though other names are common on the
Windows platform.) Such areas include:
* ''large-file support'' for people working with lots of data. Note
that at the moment Tcl cannot even report the file type for a file
that is larger than 2GB in size.
* ''large value support'' for people working with network addresses
(this is likely to come up more in the future with IPv6.)
However, a number of existing algorithms assume that integer
arithmetic operations wrap at 32-bits (demonstrating the need for
''semantic backward-compatibility'' so termed because a recompile of
the C portions of the relevant code will not fix the problem) and
there are many existing extensions that assume a particular word-size
too (requiring ''syntactic backward-compatibility'' because
recompilation will probably cure the problem.) Hence any upgrade of
Tcl's functionality must be done carefully so as to preserve as much
backward compatibility as possible.
~ Proposed for Changes
To resolve this problem, I will introduce:
1. A new pair of types at the C level to represent signed and
unsigned values with a width of at least 64-bits. These types
will be called ''Tcl_WideInt'' and ''Tcl_WideUInt'' respectively.
On 64-bit platforms (and 32-bit platforms where there is no
compiler support for arithmetic 64-bit types) these will be
typedef'ed to ''long'' to preserve as much inter-platform
compatibility as possible.
> The type names are based on the term ''Wide'' as opposed to either
''Long'' or ''LongLong'' because the first causes a problem with
existing Tcl APIs (''Tcl_GetLongFromObj'' for example) and the
second because it is both longer and less mnemonic. Not all Tcl
platforms are built with compilers that understand ''long long''
in the first place, and the major factor in its favour at the C
level was almost certainly the fact that it did not introduce any
new reserved words into the C syntax which would have had a major
backward-compatibility impact - we are not bound by such things
and can choose to suit ourselves.
2. A new field of type ''Tcl_WideInt'' in the internalRep union of
the ''Tcl_Obj'' type. Note that this is 100% backward compatible
since the union already contains a field that is a pair of
pointers (each of which I assume to be at least 32-bits wide.)
3. A new object type of 64-bit wide values together with accessor
functions to create, modify and retrieve from objects of that type
called ''Tcl_NewWideIntObj'', ''Tcl_SetWideIntObj'' and
''Tcl_GetWideIntFromObj'' (on platforms where ''Tcl_WideInt'' is
not distinct from ''long'', these will be all redirected to the
previously existing integer type.)
4. The [[expr]] command shall be reworked so that:
> * If a constant looks like a signed integer (i.e. it lies between
INT_MIN and INT_MAX inclusive) it is treated as such. Otherwise
if it looks like an integer of any size, an attempt will be made
to treat it like a wide integer, and if that fails or it doesn't
look like an integer at all, it will be treated as a double.
''Note'' that this will be a source of a potential backwards
incompatibility with scripts that include values that are meant
to be unsigned integers.
> * With arithmetic operations, the output will be a double if at
least one of the operands is a double, a wide integer if at
least one of the operands is a wide integer, and a normal
integer otherwise. (The main exception to this will be the left
and right shift operations where the type of the second operand
will not affect the type of the result.)
> * The ''int()'' pseudo-function will always return a non-wide
integer (converting by dropping the high bits) and the new
pseudo-function ''wide()'' will always return a wide integer
(converting by sign-extending.) On platforms without a distinct
64-bit type, these operations will behave identically.
> * User-defined functions will be able to gain access to the wide
integer through an extra ''wideValue'' field in the
''Tcl_Value'' structure and TCL_WIDE_INT (which will be the same
as TCL_INT on platforms without a distinct 64-bit type) value in
the ''Tcl_ValueType'' enumeration.
5. The [[incr]] command will be able to increment variables
containing 64-bit values correctly, but will only accept 32-bit
values as amounts to increment by.
6. ''Tcl_Seek'' and ''Tcl_Tell'' (together will all channel drivers)
will be updated to use the new 64-bit type for offsets (which will
reflect at the Tcl level in the [[seek]] and [[tell]] commands)
though a compatibility interface for old extensions that do not
supply a channel driver will be maintained (though the size of
offset reportable through the interface will naturally be limited.)
7. ''Tcl_FSStat'' and ''Tcl_FSLstat'' will all be
updated to use a stat structure reference that can contain 64-bit
wide values. This will enable various [[file]] subcommands (and
[[glob]] with some options) to work correctly with files over 2GB
in size. Note that there is no neat way to do this in a backward
compatible way as there is currently no guarantee on which fields
will actually be present in the structure, but those functions
have never been available outside an alpha...
> Because the name of a suitable structure varies considerably
between platforms, a new type, ''Tcl_StatBuf'', will be declared
to be the type of the structure which a pointer to should be
passed to the stat-related functions. A new function,
''Tcl_AllocStatBuf'', will be provided to allow extensions to
allocate a buffer of the correct size whatever the platform.
> Note that ''Tcl_Stat'' will written to contain
backward-compatability code so that code that references it will
work unchanged.
8. The ''format'' and ''scan'' commands will gain a significance to
the ''l'' modifier to their integer-handling conversion specifiers
(d, u, i, o and x) which will tell them to work with 64-bit values
(if those are not the default for the platform anyway.)
9. The ''binary'' command will gain new ''w'' and ''W'' specifiers
for its ''format'' and ''scan'' subcommands. These will operate
on 64-bit wide values in a fashion analogous to the existing ''i''
and ''I'' specifiers (i.e. smallest byte to largest, and largest
byte to smallest respectively.)
10. New compatibility functions will also be provided, because not
all platforms have convenient equivalent functions to ''strtoll''
and ''strtoull''.
11. ''Tcl_LinkVar'' will be extended to be given the ability to link
with a wide C variable (via a TCL_LINK_WIDE_INT flag).
12. The ''tcl_platform'' array will gain a new member, ''wordSize'',
which will give the native size of machine words on the host
platform (actually whatever ''sizeof(long)'' returns.)
~ Summary of Incompatibilities and Fixes
The behaviour of expressions containing constants that appear positive
but which have a negative internal representation will change, as
these will now usually be interpreted as wide integers. This is
always fixable by replacing the constant with ''int(''constant'')''.
Extensions creating new channel types will need
to be altered as different types are now in use in those areas. The
change to the declaration of ''Tcl_FSStat'' and ''Tcl_FSLstat'' (which
are the new preferred API in any case) are less serious as no
non-alpha releases have been made yet with those API functions.
Scripts that are lax about the use of the ''l'' modifier in ''format''
and ''scan'' will probably need to be rewritten. This should be very
uncommon though as previously it had absolutely no effect.
Extensions that create new math functions that take more than one
argument will need to be recompiled (the size of ''Tcl_Value''
changes), and functions that accept arguments of any type
(''TCL_EITHER'') will need to be rewritten to handle wide integer
values. (I do not expect this to affect many extensions at all.)
~ Why Tcl_WideInt?
I chose the name ''Tcl_WideInt'' for the type because it represents a
wider-than-normal integer. Alternatives that were considered and
rejected were:
Tcl_LongLong: This takes its name from the name of the underlying C
type used in many UNIX compilers, but that in turn was chosen
because it meant that no new keywords would be added to the
language, and not out of any feeling that the type name itself is
of any wider merit. Seeing as Tcl is a keyword-less language, there
is no particular reason for going down this route (which would
lead to things like a ''longlong()'' type conversion function added
to the [[expr]] command, which is really very ugly indeed...) It
is also not universally the name of the underlying type; the Windows
world is different (as usual.)
Tcl_Int64: This name, by contrast, comes more from the Windows world.
It's major problem is that it specifies eternally what the size of
the type is, whereas at some point in the future (when 64-bit words
are the norm) we may want to support something wider still (though
I do not yet know what uses we would put 128-bit integers to.) I
believe that the name of a type is part of its specification, but
that the size of the type is less so. ''Tcl_Int64'' is also ugly
when it comes to derivations of the name for things like the type
converter in [[expr]] (again) and the names of variables containing
values of the type (internally, as formal parameters, and as fields
of structures) and may well clash on systems where the C compiler
gives real meaning to ''int64'' by default. By contrast,
''Tcl_WideInt'' lends itself well to generating variable names
(''wideValue'', ''widePtr'', etc., and even just plain ''w'' in the
implementation of the bytecode execution engine) which, as the
person implementing the changes, was a major consideration.
~ Copyright
This document has been placed in the public domain.