Tcl Improvement Proposals: Artifact [218262106e]

Artifact 218262106e393a557410332e67e473a6aaafb99e785a3c5655e93114e82f4e5d:

File tip/122.tip — part of check-in [9a48d10869] at 2006-12-05 09:54:11 on branch trunk — Feature set of 8.5 is now closed to new TIPs without prior agreement of the TCT; TIPs targetting 8.5 now redirected to 8.6 (user: dkf size: 3809)
TIP:            122
Title:          Use tcl_{non,}wordchars Throughout Tcl/Tk
Version:        $Revision: 1.7 $
Author:         Martin Weber <[email protected]>
Author:         Martin Weber <[email protected]>
Author:         Vince Darley <[email protected]>
State:          Draft
Type:           Project
Vote:           Pending
Created:        12-Dec-2002
Post-History:   
Tcl-Version:    8.6

~ Abstract

This TIP shall bring flexible management of word and non-word chars to
Tcl, to be used throughout the Tcl realm in e.g. [[regexp]]'s \w \W,
Tk's [[textwidget]], [[string]] wordstart/wordend etc.

~Specification

Assignment to ''tcl_{non,}wordchars'' shall influence any place in Tcl
which decides whether something is a word character or not, including
detection of word boundaries in e.g. regular expressions, Tk's text
widget and so on.

For this there shall be no hard-coding of lists of values which are
word and non-word characters, and neither shall the language rely on
the language of implementation (i.e. C's ''is*()'' functions), as this
disallows dynamic changing of ''tcl_{non,}wordchars''.

Rather shall the value(s) of ''tcl_{non,}wordchars'' be used to
determine whether a given character is part of a word or not.

~Rationale

Currently in Tcl there are different hard-coded ways to decide whether
a certain character is a word character or a non word character.
Different hard-coded ways also imply that changes on one side might
not get over to the other side, so there soon are different hard-coded
ways which yield different hard-coded results.  As a inference of it
being hard-coded, this also means that there is no way to change or
fix that potentially broken behavior.  Having Tcl lookup the values of
those variables at runtime allows for the needed flexibility, both
when dealing with nonstandard demands and nonstandard character sets.

As an example of the breakage, you can assign a regular expression
to tcl_{,non}wordchars, and the double click binding in the textwidget
will regard that pattern when marking a "whole word". When you try
to ask the text widget to deliver the data under a certain coordinate
with the indices 'wortstart' and 'wordend', the value of
tcl_{non,}wordchars is not used though.

There may be a problem with the performance of the lookup, but on the
other hand are C's ''is*()'' functions also implemented via a table
lookup.  An installation of a caching static character table could
guarantee the needed performance.

~Example of current word confusion

Tk's text widget uses "word" in several ways:

1. selection by word (double-click + drag),

2. movement by word ('insert wordstart'),

3. regexp searching with \m\M wordmatching.

4. line breaks when wrapping (-wrap word)

It is not at all clear from reading Tcl or Tk's documentation what the
behaviour of the above options will be.  It turns out that:

1. after a convoluted call-chain, ends up calling
tcl_wordBreakAfter/Before which use tcl_wordchars and tcl_nonwordchars (which actually are defined differently on Windows vs Unix/MacOS!!).

2. uses 'isalnum(char)' or '_' to define a word (hard-coded in Tk's
tkTextIndex.c) (in Tk8.5a0 this has been fixed to use ''Tcl_UniCharIsWordChar'')

3. uses Tcl's regexp engine's definition of a word (this ought to be the same as that used in (2)).

4. Anything separated by white-space from something else, used with '-wrap word' to define line-wrapping in text widgets (and canvases).

It is quite likely that most of the above are different under some
circumstances or some platforms/locales, and certainly if the
user/developer wants to create a text widget with a different word
definition, they basically can't in any consistent way.

~Implementation

None yet.

~Copyright

This document is placed in the public domain.