← Back to index | Original Bugzilla link

Bug 4250 – std.regex does not support character sets other than unicode

Status: ASSIGNED
Severity: enhancement
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2010-05-29T07:46:59Z
Last change time: 2024-12-01T16:13:24Z
Keywords: patch
Assigned to: Dmitry Olshansky
Creator: Lionello Lunesu

Attachments

ID	Filename	Summary	Content-Type	Size
647	charstride.diff	Patch against phobos/std/regex.d in dmd.2.046.zip	application/octet-stream	1248
648	charstride.diff	Patch against phobos/std/regex.d in dmd.2.046.zip	text/plain	1255
649	test4250.d	Testcase (using GB18030 encoded date with std.regex)	application/octet-stream	727

Comments

Comment #0 by lio+bugzilla — 2010-05-29T07:46:59Z

Created attachment 647 Patch against phobos/std/regex.d in dmd.2.046.zip I'm writing an application that works with Chinese text encoded in GBK, http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first, before using regex, but it's much faster to leave the text as-is and only convert the regular expression to GBK instead. I suspect the following opcode need patching: 1. REanychar uses std.utf.stride; 2. REdchar and REidchar are used when the character in the regex >= 0x80; 3. REichar and REidchar use std.ctype.toupper (during creation and execution) Point 1 and 3 are easily solved by providing the user with callback functions. To prevent unnecessary indirection, these can be aliases if (is(__traits(compiles, std.utf.stride(new E[], 0)))).d Attached a proof of concept patch for point 1. If this is OK, I can do the same for point 2 and 3 as well. (Point 2 might not even need a patch; not clear about that now.)

Comment #1 by lio+bugzilla — 2010-05-29T17:53:38Z

Created attachment 648 Patch against phobos/std/regex.d in dmd.2.046.zip Fixed the diff.

Comment #2 by lio+bugzilla — 2010-05-29T18:02:11Z

Created attachment 649 Testcase (using GB18030 encoded date with std.regex)

Comment #3 by bugzilla — 2010-05-30T11:02:48Z

It's not designed to do anything but UTF, so marked as an enhancement request.

Comment #4 by dmitry.olsh — 2012-03-12T03:34:56Z

The first straightforward step would be to add option to skip UTF-processing assuming it is plain ASCII, that covers an important use case. The next move largely depends on std.encoding or whatever it would be.

Comment #5 by dmitry.olsh — 2012-07-22T08:21:13Z

Comment on attachment 648 Patch against phobos/std/regex.d in dmd.2.046.zip Old regex is gone for good since 2.056.

Comment #6 by robert.schadek — 2024-12-01T16:13:24Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9886 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB