Tcl Source Code

View Ticket
Login
2015-09-21
19:04
[1115587][0e0e150e49] Major fix for regexp handling of quantified backrefs. Contributed by Tom Lane ... check-in: c8dfe06653 user: dgp tags: trunk
19:03 Closed ticket [1115587fff]: Regexp backreference fail with a * closure plus 7 other changes artifact: e905141b9d user: dgp
2015-09-19
16:01 Ticket [1115587fff]: 7 changes artifact: c1c403cdb8 user: tgl
2012-07-10
23:53 Ticket [1115587fff]: 4 changes artifact: b822797e2a user: andreas_kupries
23:44 Ticket [1115587fff]: 4 changes artifact: 7fbc21918f user: dgp
2012-02-24
04:14 Ticket [1115587fff]: 4 changes artifact: 25e2ea7786 user: dkf
2012-02-14
22:56 Ticket [1115587fff]: 4 changes artifact: 287608c789 user: tgl
17:07 Ticket [1115587fff]: 4 changes artifact: bf9f36cc15 user: dkf
11:35 Ticket [1115587fff]: 4 changes artifact: 344daa0ec3 user: tgl
09:18 Ticket [1115587fff]: 4 changes artifact: caef4df93d user: tgl
09:06 Ticket [1115587fff]: 4 changes artifact: 2ffd8b0e9a user: tgl
2009-05-24
02:56 Ticket [1115587fff]: 5 changes artifact: d2395db304 user: dkf
2007-10-02
02:04 Ticket [1115587fff]: 4 changes artifact: 6820a2f8a5 user: erh
2005-09-20
03:26 Ticket [1115587fff]: 4 changes artifact: 664665662d user: dkf
03:24 Ticket [1115587fff]: 4 changes artifact: 2d91195c24 user: dkf
2005-09-19
04:01 Ticket [1115587fff]: 4 changes artifact: bcbb9dfd50 user: wielemaker
2005-02-03
17:28 New ticket [1115587fff]. artifact: e6921d3161 user: jkbonfield

Ticket UUID: 1115587
Title: Regexp backreference fail with a * closure
Type: Bug Version: obsolete: 8.4.9
Submitter: jkbonfield Created on: 2005-02-03 17:28:11
Subsystem: 43. Regexp Assigned To: dgp
Priority: 8 Severity: Minor
Status: Closed Last Modified: 2015-09-21 19:03:07
Resolution: Fixed Closed By: dgp
    Closed on: 2015-09-21 19:03:07
Description:
The documentation has an example:

"([bc])\1 matches bb or cc but not `bc'". So I tried:

% regexp {^([bc])\1$} bb
1
% regexp {^([bc])\1$} cc
1
% regexp {^([bc])\1$} bc
0

All well and good. But what if we want b* or c* but not
[bc]* (ie disallow bbcb). The obvious solution is \1*:

% regexp {^([bc])\1*$} bbb
1
% regexp {^([bc])\1*$} bcb
1

bcb should not match!

Oddly, using \1+ works (but means of course that "b" or
"c" solo does not match).

James
User Comments: dgp added on 2015-09-21 19:03:07:
Fix accepted for 8.6.5.

tgl added on 2015-09-19 16:01:47:
See ticket http://core.tcl.tk/tcl/tktview?name=0e0e150e49479e3f for a proposed solution.

andreas_kupries added on 2012-07-10 23:53:20:
A side note, commenting on "But there are two significant regex features that DFA can't handle: capturing parentheses and back-references."

While I agree with the above on the back-references the capturing parentheses can actually be handled by a DFA style RE matcher. Russ Cox, who wrote about the performance differences between automata and backtracking RE matchers at the beginning of 2007 wrote two followup articles about this a few years later, as I found out last week.

His article describing how to handle capturing parens is at http://swtch.com/~rsc/regexp/regexp2.html, written at the end of 2009.

dgp added on 2012-07-10 23:44:09:
CLT reports the PostgresQL project is committing fixes to
the Spencer RE code.

http://git.postgresql.org/gitweb/?p=postgresql.git&a=search&h=HEAD&st=commit&s=regex

Someone interested might check if this bug (or others) have fixes to borrow.

dkf added on 2012-02-24 04:14:26:
I've committed tests (14.21 through 14.23 in reg.test) to the core-8-5-branch and trunk that characterize this bug. The one that _really_ characterizes the bug is marked with knownBug.

The best place to start committing will be off the core-8-5-branch tip since we prefer to "merge forward"; contact me if you need any account setting up on core.tcl.tk

tgl added on 2012-02-14 22:56:04:
Yeah, that seems more reliable than trying to pass patches through this unfriendly tracker.  What branch should I use?

dkf added on 2012-02-14 17:07:34:
You comprehend this far more deeply than I do, but yes, if it works we'll accept it. Do you want commit privileges to the repository so you can work on a public branch?

tgl added on 2012-02-14 11:35:59:
Some notes about fixing the more general issue ...

For context, realize that Tcl's regex engine is a hybrid DFA/NFA
implementation.  Mostly it relies on the DFA logic, which has fast and
stable performance because it never has to backtrack.  But there are two
significant regex features that DFA can't handle: capturing parentheses and
back-references.  To deal with those things, the regex compilation code
builds a tree of "subRE"s, which will be searched at runtime in NFA style.
The existing types of subREs are:

* plain: matches any string described by a DFA.
* backref: matches any string that is equal to a string identified by a
previous capture (or actually, any string that is N repetitions of such
a string, but I digress).
* capture: matches any string that matches its child subRE, then saves the
match for later output or backref-matching.
* concatenation: matches any string that matches the concatenation of its
two child subREs.
* alternation: matches any string that matches either of its two child
subREs.

concatenation implies searching, since it has to consider each possible
division of the presented string into two parts to see if they can match
its two child subREs.  Similarly, alternation requires searching, although
it has only two choices to consider not N.  To reduce unnecessary
searching, Henry Spencer made the compilation code create a DFA for each
subRE node, not only the "plain" nodes.  The DFA describes what could
possibly match the subRE in as much detail as DFA can manage (which is
actually everything except the effects of back-refs).  At runtime, the
subRE tree searching code first executes the associated DFA, and gives up
immediately if it finds no match.  Otherwise it starts working through the
possible match candidates.

Now, if you're a programmer, it probably struck you as odd that the subRE
types listed above don't include any iteration capability.  How did Henry
expect to deal with quantified regexes?  The answer appears in the "it's
quantifier time" portion of parseqatom().  He transforms the general case
x{m,n} as follows:

* first, turn x{0,...} into x{1,...}|empty

* next, knowing we have m>0 and n>0, turn x{m,n} into x{m-1,n-1}x, with
  capturing parens in only the second x

The reference to capturing parens is telling --- he wasn't thinking about
back-refs when he wrote this.  What the code is actually doing is pushing
all the "messy" stuff (that is, subREs possibly involving capture or
backref) into the second child of the concatenation subRE it creates here.
The first child subRE, for "x{m-1,n-1}", is always implemented as a plain
DFA node.  Now, that's fine for messiness that consists of capturing
parens; capturing parens don't affect match behavior, so there's no need
to consider them in the first child subRE.  (What this implementation
effectively says is that if you write capturing parens inside a quantifier,
say "(x)+", what will be captured is the text that matched the last
iteration of the quantifier.  No objection from me, though I do not see
this behavioral detail documented in the re_syntax man page ... and come to
think of it, I wonder whether the POSIX regex spec has anything to say
about the subject.)

BUT: that logic falls down completely when you consider back-refs.
A back-ref absolutely does affect match behavior, but what the first child
subRE has to work with is only a copy of the DFA for the back-ref's
referent expression.  That is, if you write something like
(\w+)(\s+\1)+
(intended to match 2 or more repetitions of the same word), the actual
behavior you get is equivalent to
(\w+)(\s+\w+)*(\s+\1)
ie, the last word is checked to be equal to the first, but any words in the
middle are only checked to be words, because the subRE for "x{m-1,n-1}" is
only a DFA for "(\s+\w+)*".

Note that this only comes into play when the back-ref is a part of a larger
quantified expression --- a quantifier attached directly to a back-ref uses
a different code path, which has a problem with "\n*" as I documented
earlier, but not this more general bug.  If you work through the examples
mentioned so far, I think they are all explained by one issue or the other.

I'm inclined to think that the way to attack this is to invent an
"iteration" subRE type, drop the two compilation-time transforms mentioned
above in favor of just plopping an iteration subRE atop any messy
expression that has to be quantified, and do the search work
straightforwardly at runtime.  However, this looks like a sizable amount of
work that will result in a very nontrivial patch.  So before starting, I'm
soliciting some feedback about whether this seems like a sane solution that
would get accepted.

tgl added on 2012-02-14 09:18:11:
Sigh, that doesn't look very nice either.  If you can't extract the patch from patch item 3487443, email me at tgl at sss dot pgh dot pa dot us, and I'll send you a clean copy.

tgl added on 2012-02-14 09:06:37:
I tried to upload this as a patch, but that didn't seem to work amazingly well, so let's see about pasting it right into the comment box ...


The attached patch fixes the case of "\n*", that is a backref with *
quantification directly attached to it.

The core of the problem is in regcomp.c, which first changes "\n*" to
"\n+|" (lines 1167ff as of 8.5.11) and then applies repeat() to the NFA
representing the backref atom (line 1198).  repeat() thinks that any arc
leading into its "rp" argument is part of the sub-NFA to be repeated ---
see moveins() call at line 1363.  Unfortunately, since line 1170 already
created the arc that was intended to represent the empty bypass around
"\n+", this arc gets moved too, so that it now leads into the state loop
created by repeat().  Thus, what was supposed to be an "empty" bypass gets
turned into something that represents zero or more repetitions of the NFA
representing the backref atom.  In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.

We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already.  What I propose instead
is to allow the m == 0 case to be handled at runtime, so that there is no
need to install the empty-alternative branch at all for the backref case.
This makes the patch in regcomp.c a one-liner, at the cost of having to
tweak cbrdissect() a little.  In the attached patch I went a bit further
than that and rewrote cbrdissect() to check all the string-length-related
conditions before it starts comparing characters.  It seems a bit stupid to
possibly iterate through many many copies of an n-character backreference,
only to fail at the end because the target string's length isn't a multiple
of n --- we could have found that out before starting.  The existing coding
could only be a win if integer division is hugely expensive compared to
character comparison, but I don't know of any modern machine where that
might be true.

The patch also corrects an obviously bogus comment at regexec.c:802.
Perhaps at one time it was valid, but now this is the only caller of
cbrdissect() ...

This does not fix all the problems with quantified back-references.  In
particular, the code is completely broken for back-references that appear
within a larger expression that is quantified, so several of the cases
shown in the comments to this bug entry still don't work.  I do not have
a solution for that yet; it looks to me like it may require major surgery.
However, the changes I propose now seem appropriate to make anyway.  Doing
direct handling of quantification in a simple back-ref is a performance win
compared to anything we can possibly do at a higher level, so this code
path seems worth preserving even if we need another solution for the more
general case.


diff -cr tcl8.5.11.orig/generic/regcomp.c tcl8.5.11/generic/regcomp.c
*** tcl8.5.11.orig/generic/regcomp.cTue Nov  1 10:04:51 2011
--- tcl8.5.11/generic/regcomp.cMon Feb 13 16:03:05 2012
***************
*** 1161,1170 ****
      }
  
      /*
!      * It's quantifier time; first, turn x{0,...} into x{1,...}|empty
       */
  
!     if (m == 0) {
  EMPTYARC(s2, atom->end);/* the bypass */
  assert(PREF(qprefer) != 0);
  f = COMBINE(qprefer, atom->flags);
--- 1161,1172 ----
      }
  
      /*
!      * It's quantifier time.  If the atom is just a BACKREF, we'll let it deal
!      * with quantifiers internally.  Otherwise, the first step is to turn
!      * x{0,...} into x{1,...}|empty
       */
  
!     if (m == 0 && atomtype != BACKREF) {
  EMPTYARC(s2, atom->end);/* the bypass */
  assert(PREF(qprefer) != 0);
  f = COMBINE(qprefer, atom->flags);
diff -cr tcl8.5.11.orig/generic/regexec.c tcl8.5.11/generic/regexec.c
*** tcl8.5.11.orig/generic/regexec.cTue Nov  1 10:04:51 2011
--- tcl8.5.11/generic/regexec.cMon Feb 13 16:03:11 2012
***************
*** 799,805 ****
      case '|':/* alternation */
  assert(t->left != NULL);
  return caltdissect(v, t, begin, end);
!     case 'b':/* back ref -- shouldn't be calling us! */
  assert(t->left == NULL && t->right == NULL);
  return cbrdissect(v, t, begin, end);
      case '.':/* concatenation */
--- 799,805 ----
      case '|':/* alternation */
  assert(t->left != NULL);
  return caltdissect(v, t, begin, end);
!     case 'b':/* back reference */
  assert(t->left == NULL && t->right == NULL);
  return cbrdissect(v, t, begin, end);
      case '.':/* concatenation */
***************
*** 1065,1076 ****
      chr *begin,/* beginning of relevant substring */
      chr *end)/* end of same */
  {
-     int i;
      int n = t->subno;
!     size_t len;
!     chr *paren;
      chr *p;
-     chr *stop;
      int min = t->min;
      int max = t->max;
  
--- 1065,1076 ----
      chr *begin,/* beginning of relevant substring */
      chr *end)/* end of same */
  {
      int n = t->subno;
!     size_t numreps;
!     size_t tlen;
!     size_t brlen;
!     chr *brstring;
      chr *p;
      int min = t->min;
      int max = t->max;
  
***************
*** 1081,1091 ****
  
      MDEBUG(("cbackref n%d %d{%d-%d}\n", t->retry, n, min, max));
  
      if (v->pmatch[n].rm_so == -1) {
  return REG_NOMATCH;
      }
!     paren = v->start + v->pmatch[n].rm_so;
!     len = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;
  
      /*
       * No room to maneuver -- retries are pointless.
--- 1081,1092 ----
  
      MDEBUG(("cbackref n%d %d{%d-%d}\n", t->retry, n, min, max));
  
+     /* Get the backreferenced string */
      if (v->pmatch[n].rm_so == -1) {
  return REG_NOMATCH;
      }
!     brstring = v->start + v->pmatch[n].rm_so;
!     brlen = v->pmatch[n].rm_eo - v->pmatch[n].rm_so;
  
      /*
       * No room to maneuver -- retries are pointless.
***************
*** 1096,1146 ****
      }
      v->mem[t->retry] = 1;
  
!     /*
!      * Special-case zero-length string.
!      */
! 
!     if (len == 0) {
! if (begin == end) {
      return REG_OKAY;
  }
  return REG_NOMATCH;
      }
! 
!     /*
!      * And too-short string.
!      */
! 
!     assert(end >= begin);
!     if ((size_t)(end - begin) < len) {
  return REG_NOMATCH;
      }
-     stop = end - len;
  
      /*
!      * Count occurrences.
       */
  
!     i = 0;
!     for (p = begin; p <= stop && (i < max || max == INFINITY); p += len) {
! if ((*v->g->compare)(paren, p, len) != 0) {
!     break;
! }
! i++;
      }
-     MDEBUG(("cbackref found %d\n", i));
- 
-     /*
-      * And sort it out.
-      */
  
!     if (p != end) {/* didn't consume all of it */
! return REG_NOMATCH;
!     }
!     if (min <= i && (i <= max || max == INFINITY)) {
! return REG_OKAY;
!     }
!     return REG_NOMATCH;/* out of range */
  }
  
  /*
--- 1097,1145 ----
      }
      v->mem[t->retry] = 1;
  
!     /* Special cases for zero-length strings */
!     if (brlen == 0) {
! /*
!  * Matches only if target is zero length, but any number of
!  * repetitions can be considered to be present
!  */
! if (begin == end && min <= max) {
!     MDEBUG(("cbackref matched trivially\n"));
      return REG_OKAY;
  }
  return REG_NOMATCH;
      }
!     if (begin == end) {
! /* Matches only if zero repetitions are okay */
! if (min == 0) {
!     MDEBUG(("cbackref matched trivially\n"));
!     return REG_OKAY;
! }
  return REG_NOMATCH;
      }
  
      /*
!      * Check target length to see if it could possibly be an allowed number of
!      * repetitions of brstring
       */
+     assert(end > begin);
+     tlen = end - begin;
+     if (tlen % brlen != 0)
+ return REG_NOMATCH;
+     numreps = tlen / brlen;
+     if (numreps < min || (numreps > max && max != INFINITY))
+ return REG_NOMATCH;
  
!     /* Okay, compare the actual string contents */
!     p = begin;
!     while (numreps-- > 0) {
! if ((*v->g->compare) (brstring, p, brlen) != 0)
!     return REG_NOMATCH;
! p += brlen;
      }
  
!     MDEBUG(("cbackref matched\n"));
!     return REG_OKAY;
  }
  
  /*

dkf added on 2009-05-24 02:56:30:
There are still serious problems here with the HEAD.

$ make shell
DYLD_LIBRARY_PATH=`pwd`: TCLLIBPATH="/Users/dkf/Documents/software/tcl/unix/pkgs" TCL_LIBRARY="/Users/dkf/Documents/software/tcl/library" ./tclsh 
% info patch
8.6b1.1
% regexp -all -inline {(.)\1*} aaabbbcd
aaabbbcd a
% regexp -all -inline {(.)\1+) aaabbbcd
^Cmake: *** [shell] Interrupt

All I was after was an expression that would allow me to get all the runs of any character (as a first stage to writing a run-length encoder). With a “*”, it matched the wrong thing. With a “+” it hangs! (Or at least it wasn't producing an answer in a reasonable amount of time so I killed it...) Elevating priority because of the “+” problem.

erh added on 2007-10-02 02:04:01:
Logged In: YES 
user_id=32759
Originator: NO

A slightly simpler set of examples is:

% regexp {^(.)\1*$} abc
1
% regexp {^(a)\1*$} abc
0
% regexp {^(.)(\1)*$} abc
0

All of these should return 0.

dkf added on 2005-09-20 03:26:59:
Logged In: YES 
user_id=79902

Oops, +? means non-greedy as opposed to the somewhat unusual
thing I wanted. Need more clutter to get that I suppose...
  % regexp -inline {^([bc])(?:\1+)?$} b
  b b
  % regexp -inline {^([bc])(?:\1+)?$} bc
  % regexp -inline {^([bc])(?:\1+)?$} bb
  bb b
  % regexp -inline {^([bc])(?:\1+)?$} bbbbbb
  bbbbbb b

dkf added on 2005-09-20 03:24:05:
Logged In: YES 
user_id=79902

There are comments in the RE engine that describe backrefs
as "the feature from the black lagoon".  I feel that's
instructive (if scary).

FWIW (and the -inline flag is great for this):
  % regexp -inline {^([bc])\1+?$} bbb
  bbb b
  % regexp -inline {^([bc])\1+?$} bcb
  % info patch
  8.4.7

wielemaker added on 2005-09-19 04:01:53:
Logged In: YES 
user_id=1222641

Looks related to what I found:

% set x {cxdaxa}
cxdaxa
% regexp {((.)x\2)+} $x match; set match
cxdaxa
% set x {cydaxa}
cydaxa
% regexp {((.)x\2)+} $x match; set match
axa

TCL version 8.4.7, SuSE Linux 9.2