Ticket UUID: | fbc56b259e989230e54a4053feeecf7aa765f61d | |||
Title: | Add support for \z in regex as end-of-string anchor | |||
Type: | Patch | Version: | ||
Submitter: | david-a-wheeler | Created on: | 2025-05-05 18:34:36 | |
Subsystem: | 43. Regexp | Assigned To: | nobody | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Open | Last Modified: | 2025-05-14 10:59:10 | |
Resolution: | None | Closed By: | nobody | |
Closed on: | ||||
Description: |
I propose that \z be accepted in regular expressions as a synonym for \Z. Below is a rationale, the one-line patch that implements it, and a few more patch lines to modify the documentation. First, the rationale. A common use for regexes in security is to check inputs, which needs anchors for "beginning of string" and "end of string" (regardless of multi-line mode, etc.).. Different regex engines notate this differently, making guidance complicated. See: https://best.openssf.org/Correctly-Using-Regular-Expressions.html People often copy regular expressions from one engine to another. 46% of developers incorrectly believe that regexes are the same everywhere per Wang et al’s “An Empirical Study on Regular Expression Bugs” (2020). It's going to get *worse* with LLMs, which don't always note context. The good news is that there *is* a notation everyone can support: \A ... \z. This is supported by: Java, .NET/C#, Perl, PCRE, PHP (using PCRE), Ruby, RE2 which is widely used by Go, and Rust crate regex which is widely used by Rust. The phrase "\Z" cannot become a generally-agreed symbol for end-of-string, because it's already used for "optional newline followed by end of string" on many platforms, including Java, .NET/C#, Perl, PCRE , PHP (using PCRE), and Ruby. Citations: https://learn.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-before-ending-newline-z https://www.pcre.org/original/doc/html/pcrepattern.html#SEC5 https://ruby-doc.org/core-2.5.8/Regexp.html The Austin Group has decided to add \A and \z (not \Z) to POSIX EREs and encourage them for POSIX BREs: https://www.austingroupbugs.net/view.php?id=1919 Python already supports \A; they've decided to add support for \z for "end-of-string" and are discussing deprecating \Z for end-of-string. https://github.com/python/cpython/issues/133306 Anyway, I hope that \A ... \z would be supported here as well. It'd be nice to be able to say "this works practically everywhere" instead of the complicated explanations that must be done today. Thanks! diff --git a/generic/regc_lex.c b/generic/regc_lex.c index bf936ca607..cf791bacf1 100644 --- a/generic/regc_lex.c +++ b/generic/regc_lex.c @@ -872,6 +872,7 @@ lexescape( NOTE(REG_ULOCALE); RETV(NWBDRY, 0); break; + case CHR('z'): case CHR('Z'): RETV(SEND, 0); break; diff --git a/doc/re_syntax.n b/doc/re_syntax.n index b2143493a1..c1c8971e53 100644 --- a/doc/re_syntax.n +++ b/doc/re_syntax.n @@ -136,6 +136,7 @@ below). .PP The difference between string and line matching modes is immaterial when the string does not contain a newline character. The \fB\eA\fR +and \fB\ez\fR and and \fB\eZ\fR constraint escapes have a similar purpose but are always constraints for the overall string. .PP @@ -452,6 +453,9 @@ matches only at the end of a word matches only at the beginning or end of a word .IP \fB\eY\fR 6 matches only at a point that is not the beginning or end of a word +.IP \fB\ez\fR 6 +matches only at the end of the string (see \fBMATCHING\fR, below, for +how this differs from .IP \fB\eZ\fR 6 matches only at the end of the string (see \fBMATCHING\fR, below, for how this differs from @@ -707,7 +711,8 @@ expressions using \fB^\fR will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and \fB^\fR and \fB$\fR will match the empty string after and before a newline respectively, in addition to matching at -beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR +beginning and end of string respectively. ARE \fB\eA\fR and +\fB\ez\fR and \fB\eZ\fR continue to match beginning or end of string \fIonly\fR. .PP If partial newline-sensitive matching is specified, this affects | |||
User Comments: |
jan.nijtmans added on 2025-05-14 10:59:10:
Put in a branch now: [31c03110d4deee2d] jan.nijtmans added on 2025-05-07 13:37:24: I don't have a problem with this change. Pity that it didn't came in a few months earlier. I'll discuss it david-a-wheeler added on 2025-05-06 19:30:40: (Quick note: The proposal is simply to *add* functionality, so this is backwards-compatible with existing TCL programs.) serhiy.storchaka added on 2025-05-06 18:18:36: I should add that the reason why |
