Tcl Improvement Proposals: Artifact [39e005b341]

Artifact 39e005b3413667c75a7ee26f5673e976b68b6f9be0a8b5901cbd867fe8ec3acb:

File tip/249.tip — part of check-in [850a09f04e] at 2007-01-16 17:46:10 on branch trunk — Revision through web from David S. Cargo <[email protected]> Two minor changes to change "is" to "shall be" (which also correct a subject-verb disagreement) and "non-empty" to "nonempty." (user: tclhttpd size: 11281)
TIP:            249
Title:          Unification of Tcl's Parsing of Numbers
Version:        $Revision: 1.8 $
Author:         Kevin B. Kenny <[email protected]>
Author:         David S. Cargo <[email protected]>
Author:         Don Porter <[email protected]>
State:          Draft
Type:           Informative
Vote:           Pending
Created:        13-Jun-2005
Post-History:   

~ Abstract

This TIP proposes to unify the recognition of all of Tcl's "numeric"
objects into a single parser.  The intended effect is to improve
performance by eliminating a number of cases where a cached numeric
representation may be discarded, and to restore (more accurately,
to establish) the "everything is a string" principle in dealing with
numbers.

~ Rationale

Tcl's handling of numbers has always been problematic and ambiguous.
Even in the earliest releases of the '''expr''' command, there were
issues with the unexpected demotion of floating point numbers to
integers, causing subsequent divisions to be interpreted as integer
division with incorrect results.

Another trouble spot has been the interpretation of constants with
leading zeroes.  When these are interpreted as integers, they are
octal numbers.  They can also be interpreted as floating point
constants (at least with ''Tcl_GetDoubleFromObj''), in which case they
are decimal.  Because of this ambiguity, the '''expr''' system cannot
make effective use of the internal representation of a floating point
number; it needs to refer back to the string to make sure that the
number is not an octal integer to which ''Tcl_GetDoubleFromObj'' has
been applied.

Even more confusing is the treatment of numbers that have leading
zeroes but contain the digits 8 or 9.  These are rejected by the
'''expr''' parser as invalid octal but are accepted by
''Tcl_GetDoubleFromObj''.

~ Proposal

This TIP proposes a strict "everything is a string" interpretation
for strings as numeric values.  The set of strings that can be
interpreted as numbers shall be partitioned into disjoint subsets,
with a single "canonical" representation for each.

This change will imply that a few C calls will break compatibility.
In particular, ''Tcl_GetDoubleFromObj'' may leave an integer internal
representation in the object, despite the documentation's assertion
that the object will shimmer.  Similarly, ''Tcl_GetDoubleFromObj''
will no longer interpret octal integers as decimal; this feature
causes only surprise and consternation.

The ''Tcl_ConvertToType'' call will also no longer force conversion
to a specific numeric type.  Since it does not do so, it is not
reasonable for extensions to use it on the numeric types.  For this
reason, the numeric types ''shall not be registered;''
''Tcl_GetObjType'' will fail when presented with one of their names.

When one of the conversion procedures ''Tcl_GetIntFromObj'',
''Tcl_GetWideIntFromObj'',  ''Tcl_GetBignumFromObj'' (assuming the
eventual approval of [237]), or ''Tcl_GetDoubleFromObj'' is called,
it will cast any pre-existing numeric internal representation that
it finds to the appropriate return type (throwing an error if the
number is too large to represent, or a double is used in an integer
context).  If the procedure finds no pre-existing numeric internal
representation, it will extract the string representation, determine
its canonical representation as a number, and store that.

The easiest way to visualize the specific sets of strings that are
recognized as numbers is with a diagram of the state machine that
implements them.

#image:249statemachine State machine that recognizes numbers.

In the diagram, "Start" represents the start state of the machine.
The leading and trailing whitespace that is allowed for all numbers
is not diagrammed, for clarity.

Intermediate states of the machine are represented by small ovals.
Large rectangles represent final states, and are labeled with
the type of number that will result.  Note that any number can
optionally begin with a '+' or '-' character, which will not
be mentioned further.  Each of the accepting states, however,
merits further discussion.

 1. The string "0" shall always represent an integer of the smallest
    type available ('''tclIntType'''). It shall never represent a
    floating point value.

 2. A leading zero followed by a string of octal digits shall be
    interpreted as an octal integer.  The integer shall be stored in
    the smallest of '''tclIntType''', '''tclWideIntType''' and
    '''tclBignumType''' that will hold it.  (Note that storing
    '''tclBignumType''' is possible without accepting [237], provided
    that the ''Tcl_Get*FromObj'' routines recognize it and convert its
    value as needed.)  The interpretation as an octal integer shall
    hold even if the string is presented to ''Tcl_GetDoubleFromObj'',
    which today interprets it as decimal.

 3. A leading zero, followed by the letter 'X' (case insensitive) and a
    string of hexadecimal digits shall be interpreted as a hexadecimal
    integer.  Again, the smallest representation needed is chosen.

 4. A string of decimal digits beginning with a nonzero digit is
    interpreted as a decimal integer and stored in the smallest
    suitable internal representation.

 5. A string of digits beginning with a zero but containing the digits
    '''8''' or '''9''' is an error; it appears to be bad octal.  It
    would be possible to allow this case in ''Tcl_GetDoubleFromObj'',
    but it seems unwise, since the consequence would be that '''string
    is double''' would accept "double" strings that will fail in
    '''expr'''.

 6. A string consisting of a nonempty sequence of decimal digits and
    a single period (which may appear anywhere within the string) is a
    valid floating point constant in 'F' format, even if it begins
    with '0'.  It is interpreted in decimal and stored in a
    '''tclDoubleType'''.  If the input number is too small to
    represent, an appropriately signed zero is stored.  If the input
    number is too large to represent, an appropriately signed infinity
    is stored.

 7. Floating point numbers in the usual 'E' format are accepted and
    interpreted in decimal. Once again, they are stored in
    '''tclDoubleType''' and are replaced with zero or infinity if they
    are too small or large.

 8. The constants, "Inf", and "Infinity" (perhaps with a leading
    signum) are interpreted as infinities.  Infinity is represented as
    '''tclDoubleType.'''

 9. The constant "NaN" is the IEEE "Not a Number" value.  It is
    specifically permitted in the parser so that '''binary format q
    NaN''' and similar calls can produce NaN on an external medium.
    The presence of NaN in expressions, or in
    ''Tcl_GetDoubleFromObj'', signals an error.  NaN is represented as
    '''tclDoubleType'''.

 10. IEEE floating point does not have a single unique NaN value, so a
     NaN may be augmented by a parenthesized string of hexadecimal
     digits, which will be stored in its least significant bits.  It
     shall not be possible to construct signalling NaN by this route;
     only quiet NaN will be supported.  NaN is represented as
     '''tclDoubleType.'''

~ Additions

In addition to the base state machine detailed above,
the state machine of the reference implementation
contains additional states to parse integer values
beginning with the '''0b''' or '''0o''' prefixes as
originally proposed in [114].  Getting these prefixes
recognized in Tcl 8.5 is an important migration step
to support migration to whatever version of Tcl drops
the "leading '''0''' implies octal format" rule.

Also in addition, the parsing routine will accept a
''flags'' value containing the flag bits below that
exert finer control on the parsing.  These extra
controls were found to be required to permit the
[[scan]] command to use the same parser.

   * '''TCL_PARSE_INTEGER_ONLY''' -- accept only integer values;
     reject strings that denote floating point values (or
     accept only the leading portion of them that are integer
     values).

   * '''TCL_PARSE_SCAN_PREFIXES''' -- ignore the prefixes '''0b'''
     and '''0o''' that are not part of the [[scan]] command's
     vocabulary.  Use only in combination with
     '''TCL_PARSE_INTEGER_ONLY'''.

   * '''TCL_PARSE_OCTAL_ONLY''' - parse only in the octal format,
     whether or not a prefix is present that would lead to octal
     parsing.  Use only in combination with
     '''TCL_PARSE_INTEGER_ONLY'''.

   * '''TCL_PARSE_HEXADECIMAL_ONLY''' - parse only in the
     hexadecimal format,
     whether or not a prefix is present that would lead to hexadecimal
     parsing.  Use only in combination with
     '''TCL_PARSE_INTEGER_ONLY'''.

   * '''TCL_PARSE_DECIMAL_ONLY''' - parse only in the decimal format,
     no
     matter whether a '''0''' prefix would normally force a different
     base.

~ Incompatibilities

The change described is sufficient to run the Tcl and Tk test suites with
unwanted test results only in the detailed format of error messages
for integer overflow and in the types returned by using the
'''testobj''' command (not part of the usual distribution) to
introspect them.  Despite this reassurance, several potential
incompatibilities are identified.

First, as mentioned above, C extensions will no longer have fine
control over Tcl's built-in numeric types, because the types
will not be registered and hence will be unavailable for use
with ''Tcl_ConvertToType.'' This is actually a
good thing, since it means that the rest of Tcl can assume that
they are well-behaved, resulting in a considerable simplification.
Most of the Tcl Core Team believes that ''Tcl_ConvertToType''
has no legitimate use in any case.

Second, it will no longer be correct to assume that
''Tcl_Get*FromObj'' will leave an internal representation of
precisely the requested type.  It is, in any case, a highly
questionable practice for callers to assume a specific internal
representation (with the possible exception of Tcl_Set*Obj and
Tcl_New*Obj).  There will no doubt be a few extensions that run
afoul of this change, but they can be fixed easily in such a way
that they will continue to compile and run on earlier versions
of Tcl.

Third, ''Tcl_GetDoubleFromObj'' will be both more and less permissive
than before.  It will no longer accept constants with a leading zero
and no decimal point or 'E' that are invalid octal numbers.
On the other hand, it will accept
constants that are too large to fit in a '''Tcl_WideInt'''; somewhat
surprisingly, '''string repeat 9 50''' cannot today be interpreted as
a double.  '''string is double''' will follow ''Tcl_GetDoubleFromObj''
in what it considers acceptable.  Any string that is accepted as
either an integer or a double by '''expr''' will be accepted in
''Tcl_GetDoubleFromObj'', and only those strings will be accepted.

Fourth, the recognition of '''0b''' and '''0o''' as valid
prefixes for integer values is a type of incompatibility.

~ Reference Implementation

See [237] for more implementation details.

~ Copyright

Copyright (c) 2005 by Kevin B. Kenny.  All rights reserved.  

This document may be distributed subject to the terms and conditions
set forth in the Open Publication License, version 1.0
[http://www.opencontent.org/openpub/].