Bug 3136 – Incorrect and strange behavior of std.regexp.RegExp if using a pattern with optional prefix and suffix longer than 1 char
Status
RESOLVED
Resolution
FIXED
Severity
major
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2009-07-04T17:37:00Z
Last change time
2015-06-09T01:27:58Z
Keywords
patch, wrong-code
Assigned to
dmitry.olsh
Creator
marcellognani
Comments
Comment #0 by marcellognani — 2009-07-04T17:37:01Z
It seems like std.regexp.RegExp get confused if I try using a pattern with optional prefix and suffix longer than 1 char.
An expression of the form ([A]{0,2})(C)([D]{0,2}) matches all off "AC", "BC", "CD", "CE", "ACD", "BCE", "ABCDE", "C" (as expected).
An expression of the form ([AB]{0,2})(C)([DE]{0,2}) or ([AB]?[AB]?)(C)([DE]?[DE]?) fails (incorrectly and unexpectedly) in some of the cases above (both "CD" and "CE", for example).
Here the code:
---
import std.regexp;
import std.stdio;
public
{
static void main()
{
RegExp eTest;
void SetExp(string pattern)
{
eTest=new RegExp(pattern,"g");
std.stdio.writeln("Testing expression ",pattern);
}
void TryString(string s)
{
std.stdio.writeln("Trying on string\"",s,"\":");
auto captures=eTest.exec(s);
if(captures.length)
{
std.stdio.writeln("Success!");
foreach(uint i,string capture;captures)
std.stdio.writeln(i,"): \"",capture,"\"");
}
else
{
std.stdio.writeln("Failure!");
}
}
SetExp(r"([A]{0,2})(C)([D]{0,2})");
TryString("AC");
TryString("BC");
TryString("CD");
TryString("CE");
TryString("ACD");
TryString("BCE");
TryString("ABCDE");
TryString("C");
TryString("F");
SetExp(r"([AB]{0,2})(C)([DE]{0,2})");
TryString("AC");
TryString("BC");
TryString("CD");
TryString("CE");
TryString("ACD");
TryString("BCE");
TryString("ABCDE");
TryString("C");
TryString("F");
SetExp(r"([AB]?[AB]?)(C)([DE]?[DE]?)");
TryString("AC");
TryString("BC");
TryString("CD");
TryString("CE");
TryString("ACD");
TryString("BCE");
TryString("ABCDE");
TryString("C");
TryString("F");
}
}
---
Here the output:
---
Testing expression ([A]{0,2})(C)([D]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"CD":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"CE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"ABCDE":
Success!
0): "CD"
1): ""
2): "C"
3): "D"
Trying on string"C":
Success!
0): "C"
1): ""
2): "C"
3): ""
Trying on string"F":
Failure!
Testing expression ([AB]{0,2})(C)([DE]{0,2})
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
Testing expression ([AB]?[AB]?)(C)([DE]?[DE]?)
Trying on string"AC":
Success!
0): "AC"
1): "A"
2): "C"
3): ""
Trying on string"BC":
Success!
0): "BC"
1): "B"
2): "C"
3): ""
Trying on string"CD":
Failure!
Trying on string"CE":
Failure!
Trying on string"ACD":
Success!
0): "ACD"
1): "A"
2): "C"
3): "D"
Trying on string"BCE":
Success!
0): "BCE"
1): "B"
2): "C"
3): "E"
Trying on string"ABCDE":
Success!
0): "ABCDE"
1): "AB"
2): "C"
3): "DE"
Trying on string"C":
Failure!
Trying on string"F":
Failure!
---
Kind regards,
Marcello Gnani
Comment #1 by marcellognani — 2009-07-08T12:06:26Z
I had the time to investigate further; the problem is related to an incorrect optimization performed by Phobos on the optional prefix.
The constructor code of the RegExp object calls "public void compile(string pattern, string attributes)", that builds a correct internal RegExp program; then, an optimization is tried calling the "void optimize()" function. In this function, during the optimization of the REbit opcode (the opcode that implements the prefix match when the prefix is of more than one letter), the optionality of the prefix is lost, leading to the incorrect behavior reported.
The simplest patch I came up is to modify slightly the "int starrchars(Range r, const(ubyte)[] prog)" function (that is called by "optimize") as follows:
. . .
case REnm:
case REnmq:
// len, n, m, ()
len = (cast(uint *)&prog[i + 1])[0];
n = (cast(uint *)&prog[i + 1])[1];
m = (cast(uint *)&prog[i + 1])[2];
pop = &prog[i + 1 + uint.sizeof * 3];
if (!starrchars(r, pop[0 .. len]))
return 0;
if (n)
return 1;
i += 1 + uint.sizeof * 3 + len;
break;
. . .
should return 0 if the n operand of the REnm opcode is 0 (this changes the line before the break statement); this avoids the insertion of the optionality-killing first filter:
. . .
case REnm:
case REnmq:
// len, n, m, ()
len = (cast(uint *)&prog[i + 1])[0];
n = (cast(uint *)&prog[i + 1])[1];
m = (cast(uint *)&prog[i + 1])[2];
pop = &prog[i + 1 + uint.sizeof * 3];
if (!starrchars(r, pop[0 .. len]))
return 0;
if (n)
return 1;
return 0;
break;
. . .
I tried it and it works now.
Maybe this solves some other regexp bug yet open.
Best regards,
Marcello Gnani