Tcl Improvement Proposals: Artifact [02a93ac0c2]

Artifact 02a93ac0c2406c577a01c4d0cc7ff38b5a10936ecabbc026f7e1029f8c324bc3:

File tip/75.tip — part of check-in [1ed50dd47a] at 2003-12-14 18:34:23 on branch trunk — Implemented by DKF (user: dkf size: 5290)
TIP:            75
Title:          Refer to Sub-RegExps Inside 'switch -regexp' Bodies
Version:        $Revision: 1.14 $
Author:         Donal K. Fellows <[email protected]>
Author:         J�nos Hol�nyi <[email protected]>
Author:         Salvatore Sanfilippo <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        28-Nov-2001
Post-History:   
Discussions-To: http://purl.org/mini/cgi-bin/chat.cgi
Keywords:       switch,regexp,parentheses
Tcl-Version:    8.5

~ Abstract

Currently, it is necessary to match a regular expression against a
string twice in order to get the sub-expressions out of the matched
string.  This TIP alters that so that those sub-exps can be
substituted directly into the body of the script to be executed.

~Rationale

Similarly to the

|   regexp -- <RE> $string matchvar submatchvar ...

of Tcl and the

|   interact -re <RE> {
|      set matches "$interact_out(0,string) $interact_out(1,string) ..."
|   }

of Tcl/Expect, it would be very helpful and would also make Tcl more
consistent if the [[switch]] command of Tcl would support references
to parenthesized REs inside the switch patterns from the bodies
associated to each of the patterns.  As it is, it is currently
necessary to match the regular expression against the string twice to
obtain this information.

~Specification

The easiest way to get the information is to place it into a variable.
All that remains is a way to specify which variable should receive the
information.  This is done by a new option to the [[switch]] command:
''-matchvar''.  The argument to this optiongives the name of a
variable in which will be placed a Tcl list of the matches discovered
by the RE engine, such that the part of the string that was matched is
given by [[lindex $var 0]], the first parenthesis by [[lindex $var
1]], etc.  The alternative to this is to use the name of an array, but
this is more expensive.

The indices which the match occurred at can also be sometimes useful.
Therefore, the new option ''-indexvar'' will also be provided which
will name a variable into which a list of match indices (each a two
item list of values in the same way that [[regexp -indices]] computes)
will be placed.  It will be legal for both -matchvar and -indexvar to
be specified in the same [[switch]] command, but only if the matching
mode is -regexp.  (The other kinds of match modes always match against
the whole string anyway.)

Both variables (if specified, of course) will contain the empty list
if the ''default'' branch is taken.

~Example

|set string "some long complicated message"
|switch -matchvar foo -indexvar bar -regexp -- $string {
|   {\w*(e)\w*} {
|      puts "matched [lindex $foo 0] with 'e' at [lindex $bar 1 0]"
|   }
|   default {
|      puts "no words containing a letter 'e' at all"
|   }
|}

~Alternatives

Actually, no new syntax is needed to achieve the mentioned ability.
The solution could adopt the behavior of [[regsub]] ''(description
taken from regsub(n))'':

 > If subSpec contains a `&' or `\0', then it is replaced in the
   substitution with the portion of string that matched exp.  If
   subSpec contains a `\''n''', where ''n'' is a digit between 1 and
   9, then it is replaced in the substitution with the portion of
   string that matched the ''n''-th parenthesized subexpression of
   exp.  Additional backslashes may be used in subSpec to prevent
   special interpretation of `&' or `\0' or `\n' or backslash.

This has the disadvantage of being incompatible with existing code
that makes use of the -regexp option to [[switch]] and which may well
have characters matching the above sequences inside already.

Another alternative can be to specify either -submatches, or -subindexes and
use three elements for every switch case. The first is the regexp,
the second the list of vars like in the [regexp] command, and the
last the script to execute.

|set string [getSomeComplexProtocolLine]
|switch -regexp -submatches -- $string {
|    {EHLO (.*)} {match heloarg} {
|       puts "Helo $heloarg"
|    }
|    {MAIL FROM: <(.*)@(.*)>} {match user host} {
|       puts "Mail from $user at $host"
|    }
|    {QUIT} {} {
|       exit
|    }
|    default {} {
|       puts "What a strange SMTP command!"
|    }
|}  

Usually submatches have quite logical names, so it is possible
that to refer they by name instead of to use [lindex] can be
more comfortable. Another minor advantage of this is that variable
names are very near the script, so it shouldn't be hard to follow
what the script is doing.

On the other side this changes a well-known fact of switch getting
as input two elements for every case; the main proposal of this TIP
has the advantage of leaving that feature of the [[switch]] command as
an invariant.  This makes the overall implementation of the feature
easier, and also makes it easier to tell people how to use.  And it
allows for trivial obtaining of both the matched string and the range
of the input string that matched.  Of course, in that case you could
just have four values for each entry, but that is getting baroque.

~Reference Implementation

http://sf.net/tracker/?func=detail&aid=848578&group_id=10894&atid=310894

~Copyright

This document has been placed in the public domain.