Bug 11765 – std.regex: Negation of character class is not applied to base class first

Status
NEW
Severity
normal
Priority
P3
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2013-12-18T11:37:06Z
Last change time
2024-12-01T16:19:37Z
Assigned to
Dmitry Olshansky
Creator
Andrej Mitrovic
Moved to GitHub: phobos#10021 →

Comments

Comment #0 by andrej.mitrovich — 2013-12-18T11:37:06Z
----- import std.regex; import std.stdio; void main() { // expected: [["3"]] - but got: [["2"]]] writeln("123456789".match("[^1--[2]]")); // the above is *currently* equivalent to: writeln("123456789".match("[^[1--[2]]]")); // which means: subtract "1 - 2" (equals 1), // and then negate it (so "2" will match first in the string) // but I expect the first case to be equivalent to: writeln("123456789".match("[[^1]--[2]]")); // which means: negate 1 (for discussion assume 2-9 range), // subtract 2 and you get 3-9, which means "3" will match first. } ----- I'm not sure whether this is just how ECMAScript does it (since std.regex references it), but e.g. .NET does negation on the base class first (The "1" class above) and *then* it does subtraction with another class. You can test this behavior here: http://refiddle.com/ Using .net syntax: [^01-[2]] 0123456789 It matches "3". Either way if this report is invalid (e.g. expected behavior) then I think we should update the docs so they state the precedence of the negation.
Comment #1 by andrej.mitrovich — 2013-12-18T11:38:32Z
(In reply to comment #0) > Using .net syntax: > [^01-[2]] > 0123456789 > > It matches "3". Nevermind the leading zero, I meant to use this simpler example: [^1-[2]] 123456789 It matches "3".
Comment #2 by dmitry.olsh — 2013-12-18T11:56:14Z
(In reply to comment #0) > I'm not sure whether this is just how ECMAScript does it (since std.regex > references it), but e.g. .NET does negation on the base class first (The "1" > class above) and *then* it does subtraction with another class. ECMAScript doesn't even have it AFAIK ;) I think you (and .NET) are right - the prioriy of unary '^' operator should be higher then that of any other binary ops.
Comment #3 by andrej.mitrovich — 2013-12-19T04:55:48Z
Is the following sample caused by the same issue? writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]")); It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b" in Ruby, as for .NET I haven't found the syntax it uses.
Comment #4 by dmitry.olsh — 2013-12-19T10:27:35Z
(In reply to comment #1) > (In reply to comment #0) > > Using .net syntax: > > [^01-[2]] > > 0123456789 > > > > It matches "3". > > Nevermind the leading zero, I meant to use this simpler example: > > [^1-[2]] > 123456789 > > It matches "3". Actually because of single dash it works as if all is fine... This one is good case: [^1--[2]]
Comment #5 by dmitry.olsh — 2013-12-19T10:31:23Z
(In reply to comment #3) > Is the following sample caused by the same issue? > > writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]")); > > It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b" > in Ruby, as for .NET I haven't found the syntax it uses. From the look of it - an unrelated bug in set intersection. Better split it off as a new issue.
Comment #6 by andrej.mitrovich — 2013-12-20T00:51:17Z
(In reply to comment #5) > (In reply to comment #3) > > Is the following sample caused by the same issue? > > > > writeln("abcdefghijklmnopqrstuvwxyz".match("[a-z&&[^aeiuo]]")); > > > > It writes [["a"]], I was expecting the first non-vowel [["b"]]. It returns "b" > > in Ruby, as for .NET I haven't found the syntax it uses. > > From the look of it - an unrelated bug in set intersection. > Better split it off as a new issue. Filed as Issue 11784.
Comment #7 by dmitry.olsh — 2014-01-10T12:24:42Z
Ruby makes me nervous: print /[^abc[e-f]&&[ybc]]/.match('~haystack') Prints '~' meaning that ^ operator has _lower_ priority then '&&'. I'm surprised but it's the precedent. And indeed the following reports empty set and warnings about '-' without escape i.e. '--' is not supported... print /[^1--[2]]/.match("0123456789") re.rb:2: warning: character class has '-' without escape: /[^2--[1]]/ re.rb:2: empty range in char class: /[^2--[1]]/ > [^1-[2]] > 123456789 > > It matches "3". And .NET is disappointing [^[2]-1] doesn't match anything. They somehow special cased only the form of [..-[set]] and arbitrary nesting of it. So we have no good precedents. My thoughts are to make it proper operator precedence grammar with priorities: 0 - implict union (pieces that stand together, evaluated first) 1 - ^ (negation) 2 - && 3 - -- 4 - || (explicit union, evaluated last)
Comment #8 by robert.schadek — 2024-12-01T16:19:37Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/10021 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB