← Back to index | Original Bugzilla link

Bug 14919 – utf/unicode should only be validated once

Status: NEW
Severity: enhancement
Priority: P4
Component: dmd
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2015-08-14T06:54:25Z
Last change time: 2024-12-13T18:44:12Z
Assigned to: No Owner
Creator: Martin Nowak
See also: https://issues.dlang.org/show_bug.cgi?id=14519

Moved to GitHub: dmd#19028 →

Comments

Comment #0 by code — 2015-08-14T06:54:25Z

Related/Alternative to issue 14519 (see https://issues.dlang.org/show_bug.cgi?id=14519#c24). When I `readText` a file a lot of time is already spent on utf validation. But we don't take advantage of that and revalidate utf in almost every algorithm. The idea from issue 14519 to replace invalid chars with a replacement makes the validation a little cheaper (b/c of the cost of dmd's EH, see issue 12442) but still incurs a high overhead. I suggest that we make a clean distinction between unvalidated ubyte[] data and treat all char/wchar/dchar[] strings as valid. The compiler already checks string literals and a few of string reading functions do it as well. Unfortunately byLine and readln currently don't validate utf. This could be a much more performant approach to correct utf handling.

Comment #1 by code — 2015-08-14T06:54:57Z

(In reply to Vladimir Panteleev from comment https://issues.dlang.org/show_bug.cgi?id=14519#c25) > Although I think this approach is acceptable (as long as the program halts > regardless of compilation flags, which shouldn't be a problem), I would like > to note that there are situations in which it is impractical to either > convert or validate the data. One example is implementations of text-based > network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting > everything to UTF-8 or verifying that it is valid UTF-8 works, because > text-based protocols often embed raw binary data. The program only needs to > parse the ASCII text parts, so the ideal solution would be a string handling > library which never decodes UTF-8 (something D doesn't have). Such text protocols don't

Comment #2 by code — 2015-08-14T06:58:26Z

Such text protocols don't randomly contain binary data. It's properly delimited either by text markers or by known offsets. So what you need to do, is to lazily validate and convert ubyte[] to ASCII/UTF, find the delimiters (could prolly be done on ubyte[]), and skip validation for the binary blob. Vice versa for binary protocols that contain strings, first work on the binary data and then validate the extracted strings.

Comment #3 by code — 2015-08-18T11:18:31Z

The transition could be done in the following order over several releases: 1. `deprecate("use UTFError instead") UTFException` and add `alias UTFError = UTFException`, so UTFError remains an Exception use UTFError in all validations 2. make UTFError an Error change all text reading functions (e.g. byLine) to eager validations 3. replace validations and UTFError with asserts

Comment #4 by robert.schadek — 2024-12-13T18:44:12Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19028 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB