Tcl Source Code

View Ticket
Login
Bounty program for improvements to Tcl and certain Tcl packages.
Ticket UUID: a4a7f2abd376b2806073d2865f07b7d910b04574
Title: Tcl [string match] and [glob] are exponential complexity for bad patterns, could be linear
Type: Bug Version: 8.6.1
Submitter: aku Created on: 2017-04-24 19:04:42
Subsystem: 16. Commands A-H Assigned To: nobody
Priority: 5 Medium Severity: Severe
Status: Open Last Modified: 2017-05-22 11:26:26
Resolution: None Closed By: nobody
    Closed on:
Description:

See https://research.swtch.com/glob about a description of the general problem (backtracking in many glob-matching implementations) and solution.

Tcl 8.6.1 is unfortunately in the group with exponential behaviour (in contrast to regexes, where Tcl is part of the nice group).

Ranking this as severe (as a possible DOS attack vector), but not critical (not a crash, although that may be possible for the larger patterns).

User Comments: aspect added on 2017-05-22 11:26:26:
I've tidied up the Tcl_StringCaseMatch implementation and improved the comments to hopefully explain the difference with Cox's strategy.  In the process found a sneaky bug ([c84d53f778]).

aku added on 2017-05-06 17:41:01:
> This deserves some commentary in the code:

Definitely needed, IMHO

> > You have (a)
> > > pnext = pattern - 1;
> > > snext = str;

> > where Russ has
> > > nextPx = px
> > > nextNx = nx + 1

> pnext = pattern - 1 because the while loop immmediately after case '*'
> increments pattern to the character after the final *.

Can you give me line numbers, because I really do not see that.
To me (a) is at lines 1991/2. Line 1994 is the goto to line 1929, which is the beginning of the outer loop, and immediately goes into the switch again. In see incrementing for case '*', and for the matchLiteral case. The other cases I don't really see.

> snext I moved the increment to the backtrack point (matchFail) to save
> counting unichars and keep the check for end-of-string out of the main
> logic flow.

That makes sense.

> I would really like to factor some of the repetition within the function
> (all the UCHAR bits), possibly move []-handling into a static function to
> keep the main loop clearer ..

I suspect we would need benchmarks to see the impact of moving all this into a separate function.

> and if we could get rid of having almost
> identical code in ByteArrayMatch, UniCharMatch and UniCharCaseMatch that
> would be a big win imo.

As much as I would like that I currently do not see it, without negatively impacting perf. The framework is the same, yes. The exact fiddly bits (mainly around "go to next char" and "nocase"-handling) OTOH not so much.

I would consider this type of cleanup to be out of scope for this ticket.
Should make a separate ticket for that.

> Thanks for the review/feedback!

aspect added on 2017-05-06 07:00:01:
This deserves some commentary in the code:

> You have (a)
> > pnext = pattern - 1;
> > snext = str;

> where Russ has
> > nextPx = px
> > nextNx = nx + 1

pnext = pattern - 1 because the while loop immmediately after case '*' increments pattern to the character after the final *.

snext I moved the increment to the backtrack point (matchFail) to save counting unichars and keep the check for end-of-string out of the main logic flow.


I would really like to factor some of the repetition within the function (all the UCHAR bits), possibly move []-handling into a static function to keep the main loop clearer .. and if we could get rid of having almost identical code in ByteArrayMatch, UniCharMatch and UniCharCaseMatch that would be a big win imo.

Thanks for the review/feedback!

aku added on 2017-05-06 03:36:22:
Questions from reading the changes (the switches(sic) between if/switch/if/switch made it a bit hard (*)).

You have (a)
> pnext = pattern - 1;
> snext = str;

where Russ has
> nextPx = px
> nextNx = nx + 1

So, you are storing -1 relative to what he does.

Later, where the next-values are used to restart I see partial compensation for this -1, i.e. (b)
> pattern = pnext;
> str = snext + TclUtfToUniChar(str, &sch);

str compensates, pattern does not seem to.

For completeness of comparison, Russ' equivalent code is
> px = nextPx
> nx = nextNx

Can you explain ?
Also, would getting rid of both the -1 in (a) and the compensation i (b) still work ?
If yes I would consider that simpler, especially as we would loose the (semi-expensive) TclUtfToUniChar needed in (b)


(*) In the end I used fossil's web diff where you can click on the (*)-marks of two revisions to get a diff between them, to see only one change if/switch.

aku added on 2017-05-06 03:10:11:
> Probably should have been one or two commits rather than three

An idea I had and got distracted from by work (and ensuing apathy for coding in the evenings) was to pull the simplest variant (ByteMatch) out of the core into a separate mini-package (critcl would make it easy), ditto the tests. Then working on the code would not incur possible breakage of the core. And allow you to chose what to commit, and/or how much.

Thanks for the work.

aspect added on 2017-05-06 00:48:05:
Tcl_StringCaseMatch changes pushed to [aspect-string-match] branch.  I was hoping to get it a bit prettier before pushing, but it didn't happen.  Probably should have been one or two commits rather than three, since I went around in a circle on if vs switch :).

The comments should at least be brought into style before applying this same transform to the other cases.

I think the behaviour change in the first commit should be acceptable - note that \ at end is tested by 11.42 and 11.50.

aku added on 2017-05-02 14:00:13:
Agree regarding #3.
Regarding the other I plan to read glob(3) first before having a strong opinion. My weak opinion is to follow glob(3).
(Want to see if they give a reason for `malformed` => match as literal)
(And see how strong `string match` already follows/deviates from this).
The original blog article which brought this to my attention unfortunately is not looking at handling `[...]` at all.

As a side note my first slap-dash attempt at the fix yesterday late night went sideways, breaking `glob`, I believe (`package require tcltest` failed, wrongly not finding the package anymore). Glad to see that I caught your interest.

Note, two of the matcher variants (TclUniChar(Case)Match) are in tclUtf.c.
The overall structure of all variants is the same, with differences in (1) the character type used and associated increment operations, and (2) (non)handling of `nocase`.

aspect added on 2017-05-02 11:12:40:
Trying to simplify Tcl_StringCaseMatch so I can attack this, a bunch of strange edge cases come up:

% string match {[a} a
1
% string match {[a-} a
0
% string match {[a-z} a
1
% string match {[a-z} b
1
% string match {[a-]} a
1
% string match {[a-]} b
0
% string match {[a-]} -
0
% string match {[-a]} a
1
% string match {[-a]} b
0
% string match {[-a]} -
1

Further, any pattern ending with an odd number of backslashes will never match.

There are three special cases to consider:

#1 - malformed pattern: ends with single \
#2 - malformed pattern: unclosed [ sequence
#3 - [a-] (and [-a])

I think #3 should match exactly "a" and "-", as [-a] does presently.  glob(3) agrees.  The wording of string(n) implies this is what will happen.

for #1, #2 there are three arguments:

 a: malformed patterns should be an error.

 b: agree with glob(3) and treat them as literals.

 c: malformed patterns match nothing.

Since Tcl_StringMatch has no way to report errors, (a) is off the table.  I lean slightly to option (c) because it keeps the code simpler, and is closest to what Tcl_StringMatch seems to want to do.

Attachments: