Bug 14919 – utf/unicode should only be validated once

Status
NEW
Severity
enhancement
Priority
P4
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2015-08-14T06:54:25Z
Last change time
2024-12-13T18:44:12Z
Assigned to
No Owner
Creator
Martin Nowak
See also
https://issues.dlang.org/show_bug.cgi?id=14519
Moved to GitHub: dmd#19028 →

Comments

Comment #0 by code — 2015-08-14T06:54:25Z
Related/Alternative to issue 14519 (see https://issues.dlang.org/show_bug.cgi?id=14519#c24). When I `readText` a file a lot of time is already spent on utf validation. But we don't take advantage of that and revalidate utf in almost every algorithm. The idea from issue 14519 to replace invalid chars with a replacement makes the validation a little cheaper (b/c of the cost of dmd's EH, see issue 12442) but still incurs a high overhead. I suggest that we make a clean distinction between unvalidated ubyte[] data and treat all char/wchar/dchar[] strings as valid. The compiler already checks string literals and a few of string reading functions do it as well. Unfortunately byLine and readln currently don't validate utf. This could be a much more performant approach to correct utf handling.
Comment #1 by code — 2015-08-14T06:54:57Z
(In reply to Vladimir Panteleev from comment https://issues.dlang.org/show_bug.cgi?id=14519#c25) > Although I think this approach is acceptable (as long as the program halts > regardless of compilation flags, which shouldn't be a problem), I would like > to note that there are situations in which it is impractical to either > convert or validate the data. One example is implementations of text-based > network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting > everything to UTF-8 or verifying that it is valid UTF-8 works, because > text-based protocols often embed raw binary data. The program only needs to > parse the ASCII text parts, so the ideal solution would be a string handling > library which never decodes UTF-8 (something D doesn't have). Such text protocols don't
Comment #2 by code — 2015-08-14T06:58:26Z
Such text protocols don't randomly contain binary data. It's properly delimited either by text markers or by known offsets. So what you need to do, is to lazily validate and convert ubyte[] to ASCII/UTF, find the delimiters (could prolly be done on ubyte[]), and skip validation for the binary blob. Vice versa for binary protocols that contain strings, first work on the binary data and then validate the extracted strings.
Comment #3 by code — 2015-08-18T11:18:31Z
The transition could be done in the following order over several releases: 1. `deprecate("use UTFError instead") UTFException` and add `alias UTFError = UTFException`, so UTFError remains an Exception use UTFError in all validations 2. make UTFError an Error change all text reading functions (e.g. byLine) to eager validations 3. replace validations and UTFError with asserts
Comment #4 by robert.schadek — 2024-12-13T18:44:12Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19028 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB