Bug 4250 – std.regex does not support character sets other than unicode

Status
ASSIGNED
Severity
enhancement
Priority
P4
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2010-05-29T07:46:59Z
Last change time
2024-12-01T16:13:24Z
Keywords
patch
Assigned to
Dmitry Olshansky
Creator
Lionello Lunesu
Moved to GitHub: phobos#9886 →

Attachments

IDFilenameSummaryContent-TypeSize
647charstride.diffPatch against phobos/std/regex.d in dmd.2.046.zipapplication/octet-stream1248
648charstride.diffPatch against phobos/std/regex.d in dmd.2.046.ziptext/plain1255
649test4250.dTestcase (using GB18030 encoded date with std.regex)application/octet-stream727

Comments

Comment #0 by lio+bugzilla — 2010-05-29T07:46:59Z
Created attachment 647 Patch against phobos/std/regex.d in dmd.2.046.zip I'm writing an application that works with Chinese text encoded in GBK, http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first, before using regex, but it's much faster to leave the text as-is and only convert the regular expression to GBK instead. I suspect the following opcode need patching: 1. REanychar uses std.utf.stride; 2. REdchar and REidchar are used when the character in the regex >= 0x80; 3. REichar and REidchar use std.ctype.toupper (during creation and execution) Point 1 and 3 are easily solved by providing the user with callback functions. To prevent unnecessary indirection, these can be aliases if (is(__traits(compiles, std.utf.stride(new E[], 0)))).d Attached a proof of concept patch for point 1. If this is OK, I can do the same for point 2 and 3 as well. (Point 2 might not even need a patch; not clear about that now.)
Comment #1 by lio+bugzilla — 2010-05-29T17:53:38Z
Created attachment 648 Patch against phobos/std/regex.d in dmd.2.046.zip Fixed the diff.
Comment #2 by lio+bugzilla — 2010-05-29T18:02:11Z
Created attachment 649 Testcase (using GB18030 encoded date with std.regex)
Comment #3 by bugzilla — 2010-05-30T11:02:48Z
It's not designed to do anything but UTF, so marked as an enhancement request.
Comment #4 by dmitry.olsh — 2012-03-12T03:34:56Z
The first straightforward step would be to add option to skip UTF-processing assuming it is plain ASCII, that covers an important use case. The next move largely depends on std.encoding or whatever it would be.
Comment #5 by dmitry.olsh — 2012-07-22T08:21:13Z
Comment on attachment 648 Patch against phobos/std/regex.d in dmd.2.046.zip Old regex is gone for good since 2.056.
Comment #6 by robert.schadek — 2024-12-01T16:13:24Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9886 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB