← Back to index | Original Bugzilla link

Bug 14519 – Get rid of unicode validation in string processing

Status: NEW
Severity: enhancement
Priority: P4
Component: druntime
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2015-04-29T01:42:53Z
Last change time: 2024-12-07T13:35:18Z
Assigned to: No Owner
Creator: Walter Bright
See also: https://issues.dlang.org/show_bug.cgi?id=14919, https://issues.dlang.org/show_bug.cgi?id=20134

Comments

Comment #0 by bugzilla — 2015-04-29T01:42:53Z

Consider: @safe pure nothrow @nogc void bar(); void foo(string s) @safe pure nothrow @nogc { foreach (dchar c; s) bar(); } This fails to compile because foreach over a decoded string can throw. It also incorrectly is regarded as @nogc, because the throw can allocate. Changing foreach to return replacementDchar on invalid UTF encodings fixes these problems, and makes it possible to do faster loops.

Comment #1 by dlang-bugzilla — 2015-04-29T01:54:03Z

As discussed in the newsgroup: please don't do this.

Comment #2 by issues.dlang — 2015-04-29T06:13:38Z

(In reply to Vladimir Panteleev from comment #1) > As discussed in the newsgroup: please don't do this. If anything, I thought that the consensus lead towards making the change, but maybe I didn't pay enough attention to the discussion in question. Certainly, I'm very much in favor of making use of replacementDchar on invalid UTF in general and have code explicitly validate unicode if that's what's desired. There is some risk with making the change, since existing code may not handle the replacement character well, but very little string processing is going to actually care, and throwing exceptions like we currently do is not only a performance problem, it's hugely annoying when you need to actually process invalid unicode, which is pretty easy to have happen if you're doing stuff like parsing websites or files which were written by programs that didn't handle unicode correctly.

Comment #3 by dlang-bugzilla — 2015-04-29T06:15:25Z

> it's hugely annoying when you need to actually process invalid unicode Yeah, and it'll me a lot more than just "annoying" when you discover too late that your data's been irreversibly corrupted because it wasn't in the correct encoding.

Comment #4 by dlang-bugzilla — 2015-04-29T06:25:42Z

Here's a counter-proposal: when encountering invalid UTF-8, instead of throwing exceptions, throw errors. This will fix the nothrow and performance problems, and will avoid the risk of data corruption. The workaround is to pre-sanitize the input. The impact of breaking existing code is the same as the original proposal.

Comment #5 by dfj1esp02 — 2015-04-29T07:20:23Z

Or provide a global override similar to assertHandler.

Comment #6 by issues.dlang — 2015-04-29T07:36:51Z

(In reply to Vladimir Panteleev from comment #4) > Here's a counter-proposal: when encountering invalid UTF-8, instead of > throwing exceptions, throw errors. This will fix the nothrow and performance > problems, and will avoid the risk of data corruption. Yikes. That is far worse than throwing Exceptions, since it would kill your program, and it's indicative of a bug in the program rather than bad input. > The workaround is to > pre-sanitize the input. The impact of breaking existing code is the same as > the original proposal. Pre-sanitizing input is exactly what should be done if you care about unicode validation. You validate any strings entering the program from a file, a socket, or from user input, and then you know that you're operating on valid Unicode. But most programs just don't care about how valid the Unicode is, and the fact that throwing is how it's handled is incredibly annoying. It forces validation on all programs whether they need it or not, and it makes it so that string-based code can pretty much never be nothrow. Using the replacement character in the stead of invalid unicode is exactly what it was created for in the first place.

Comment #7 by dlang-bugzilla — 2015-04-29T07:41:14Z

(In reply to Jonathan M Davis from comment #6) > Yikes. That is far worse than throwing Exceptions, since it would kill your > program, and it's indicative of a bug in the program rather than bad input. Yes. The bug is that the string should've been sanitized. > But most programs just don't care about how valid the > Unicode is, Maybe most programs YOU write. > and the fact that throwing is how it's handled is incredibly > annoying. I can see how it can be annoying - when you don't care about your data. > It forces validation on all programs whether they need it or not, > and it makes it so that string-based code can pretty much never be nothrow. Throwing errors is allowed in nothrow code. > Using the replacement character in the stead of invalid unicode is exactly > what it was created for in the first place. Yes, in circumstances when you don't care about the "invalid" data, which should always be opt-in.

Comment #8 by bearophile_hugs — 2015-04-29T08:01:25Z

(In reply to Walter Bright from comment #0) > Changing foreach to return replacementDchar on invalid UTF encodings fixes > these problems, and makes it possible to do faster loops. Another solution is to deprecate foreach iteration on strings, and require something like "foreach(c; mystring.byCharThrowing)" and similar things.

Comment #9 by issues.dlang — 2015-04-29T08:05:51Z

Most string-based functions work perfectly well with invalid Unicode. Does find care? Does startsWith? Does filter? The replacement character simply won't match what you're looking for. The functions themselves don't care. The replacement character is just another character. They need a way to deal with invalid Unicode, but the replacement character deals with that beautifully. The concern is whether program input is valid - whether the user manages to type in invalid Unicode due to bad terminal settings, or whether you get junk off a socket, or whether a file has been corrupted. Anything that cares should be checking that when the data enters the program so that the error can be reported to whoever or wherever the data is coming from. Having it done via exceptions later on disconnects the reporting of the error from the point when it can actually be handled. What do you do if you read in an XML file and process half of it before you hit invalid Unicode? If the whole file was read into memory, then you may not even have any any idea where that string came from, and it's likely far too late to report to the user that they're opening a corrupted file. That validation really needs to be done when the string enters the program - not at some arbitrary point later in the program when the invalid portion happens to be decoded. So, if you insist that all strings be validated, then maybe throwing an Error makes sense, but an Exception sure doesn't. And throwing an Error assumes that you always need to validate the Unicode in strings, which definitely isn't the case when the replacement character is used. So, throwing an Error is forcing everyone to validate the Unicode in their strings whether they care or not, and using the replacement character will work, whereas the programs that do care about validating their strings should be doing the validation up front anyway. So, given that the code that cares about validation needs to be validating up front and therefore doesn't care about the replacement character being used later and that programs that don't care about validating their Unicode input will work just fine with the replacement character, it seems to me that it makes perfect sense to just use the replacement character rather than throwing.

Comment #10 by dlang-bugzilla — 2015-04-29T08:30:57Z

OK, I see from your post that you don't see many of the problems with the replacement character. Let me show you some example problematic situations: 1. Bob wants to update his company's documents to use the new name for his product. He writes a program that does a recursive pattern search & replace in a directory. After testing the program on a few sample files, he is satisfied with the results, and runs the program on his company's document store. Six months later, long after the documents went out of backup rotation, Sue finds that some important historical documents have been irreversibly corrupted and full of Unicode replacement characters encoded as UTF-8. Why? Because these old documents did not use UTF-8, and Bob used D. 2. Bob is writing a secure server-side software package (let's say, a confidential document store). He is using a std.algorithm-based hashing algorithm to store the passwords securely. At some point, Mary signs up and creates a secure password, which contains entirely Cyrillic letters (let's say, "ЭтоМойПароль"). Not long after, Eve successfully logs into Mary's account with the password "ЯЯЯЯЯЯЯЯЯЯЯЯ". Why? Because the passwords just happened to be sent in some non-UTF-8 encoding, and, since Bob used D, when "normalized" through std.algorithm's replacement character subtitution, all Unicode-only passwords of the same length have the same hash. Automatic use of the replacement character will come as a surprise to many people who come from other languages. For example, in Delphi, strings are also the de-facto ubyte[] / void[] type - you can safely read a binary file into a string, perform search and replace, and write it back, knowing that the result will be exactly what you expected. Furthermore, from your message it appears to me that you've missed the point of my argument: > What do you do if you read in an XML file and process half of it before you hit invalid Unicode? You abort! This should not happen. Either the XML file is in an incorrect encoding (which puts to question the integrity of all the data parsed so far - what if it was some 8-bit encoding that only LOOKED like valid UTF-8?) or the program should've sanitized the input first if it really didn't care about data correctness. But this is an XML file, meaning it's very likely to be machine generated - if it contains errors, it might indicate a problem somewhere else in the system, which is why it's all the more important to abort and get the user to figure out the true source of the problem. Ignoring the error here reminds me of how PHP never stops on errors by default, or Basic's "ON ERROR GOTO NEXT". > So, throwing an Error is forcing everyone to validate the Unicode in their strings whether they care or not, and using the replacement character will work, whereas the programs that do care about validating their strings should be doing the validation up front anyway. Yes, but then there is no way to make sure you're not accidentally corrupting data! Whereas now we only have a runtime check against invalid UTF-8, now we will have no check at all. With no automatic mechanism to ensure that all text is sanitized before it gets into std.algorithm, it becomes impossible to be sure that you're not accidentally corrupting data along the way.

Comment #11 by bugzilla — 2015-04-29T10:01:18Z

https://github.com/D-Programming-Language/druntime/pull/1240

Comment #12 by bugzilla — 2015-04-29T10:03:13Z

(In reply to bearophile_hugs from comment #8) > Another solution is to deprecate foreach iteration on strings, and require > something like "foreach(c; mystring.byCharThrowing)" and similar things. That's not a solution as I bet it breaks 50% of the programs out there.

Comment #13 by bugzilla — 2015-04-29T10:26:18Z

Vladimir, you bring up good points. I'll try to address them. First off, why do this? 1. much faster 2. string processing can be @nogc and nothrow. If you follow external discussions on the merits of D, the "D is no good because Phobos requires the GC" ALWAYS comes up, and sucks all the energy out of the conversation. So, on to your points: 1. Replacement only happens when doing a UTF decoding. S+R doesn't have to do conversion, and that's one of the things I want to fix in std.algorithm. The string fixes I've done in std.string avoid decoding as much as possible. 2. Same thing. (Running normalization on passwords? What the hell?) The replacement char thing was not invented by me, it is commonplace as users don't like their documents being wholly rejected for one or two bad encodings. I know that many programs try to guess the encoding of random text they get. Doing this by only reading a few characters, and assuming the rest, is a strange method if one cares about the integrity of the data. Having to constantly re-sanitize data, at every step in the pipeline, is going to make D programs uncompetitive speed-wise.

Comment #14 by dlang-bugzilla — 2015-04-29T10:35:06Z

(In reply to Walter Bright from comment #13) > Vladimir, you bring up good points. I'll try to address them. First off, why > do this? > > 1. much faster If I understand correctly, throwing Error instead of Exception will also solve the performance issues > 2. string processing can be @nogc and nothrow. If you follow external > discussions on the merits of D, the "D is no good because Phobos requires > the GC" ALWAYS comes up, and sucks all the energy out of the conversation. Ditto, but the @nogc aspect can also be solved with the refcounted exceptions spec, which will fix the problem in general. > So, on to your points: > > 1. Replacement only happens when doing a UTF decoding. S+R doesn't have to > do conversion, and that's one of the things I want to fix in std.algorithm. > The string fixes I've done in std.string avoid decoding as much as possible. Inevitably it is still very easy to to accidentally use something that auto-decodes. There is no way to statically make sure that you don't (except for using a non-string type for text, which is impractical), and with this proposed change, there will be no run-time way to handle this either. > 2. Same thing. (Running normalization on passwords? What the hell?) I did not mean Unicode normalization - it was a joke (std.algorithm will "normalize" invalid UTF characters to the replacement character). But since .front on strings autodecodes, feeding a string to any generic range function in std.algorithm will cause auto-decoding (and thus, character substitution). > The replacement char thing was not invented by me, it is commonplace as > users don't like their documents being wholly rejected for one or two bad > encodings. I know, I agree it's useful, but it needs to be opt-in. > I know that many programs try to guess the encoding of random text they get. > Doing this by only reading a few characters, and assuming the rest, is a > strange method if one cares about the integrity of the data. I don't see how this is relevant, sorry. > Having to constantly re-sanitize data, at every step in the pipeline, is > going to make D programs uncompetitive speed-wise. I don't understand what you mean by this. You could say that any way to handle invalid UTF can be seen as a way of sanitizing data: there will always be a code path for what to do when invalid UTF is encountered. I would interpret "no sanitization" as not handling invalid UTF in any way (i.e. treating it in an undefined way).

Comment #15 by bugzilla — 2015-04-29T10:56:40Z

(In reply to Vladimir Panteleev from comment #14) > If I understand correctly, throwing Error instead of Exception will also > solve the performance issues It still allocates memory. But it's worth thinking about. Maybe assert()? > Ditto, but the @nogc aspect can also be solved with the refcounted > exceptions spec, which will fix the problem in general. We'll see. That's still a ways off. > > 2. Same thing. (Running normalization on passwords? What the hell?) > > I did not mean Unicode normalization - it was a joke (std.algorithm will > "normalize" invalid UTF characters to the replacement character). But since > .front on strings autodecodes, feeding a string to any generic range > function in std.algorithm will cause auto-decoding (and thus, character > substitution). That can be fixed as I suggested. > > The replacement char thing was not invented by me, it is commonplace as > > users don't like their documents being wholly rejected for one or two bad > > encodings. > I know, I agree it's useful, but it needs to be opt-in. Global opt-in for foreach is not feasible. However, one can add an algorithm "validate" which throws on invalid UTF, and put that at the start of a pipeline, as in: text.validate.A.B.C.D; > > I know that many programs try to guess the encoding of random text they get. > > Doing this by only reading a few characters, and assuming the rest, is a > > strange method if one cares about the integrity of the data. > > I don't see how this is relevant, sorry. You brought up guessing the encoding of XML text by reading the start of it: "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?" > > Having to constantly re-sanitize data, at every step in the pipeline, is > > going to make D programs uncompetitive speed-wise. > > I don't understand what you mean by this. You could say that any way to > handle invalid UTF can be seen as a way of sanitizing data: there will > always be a code path for what to do when invalid UTF is encountered. I > would interpret "no sanitization" as not handling invalid UTF in any way > (i.e. treating it in an undefined way). If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D never are executed. But if A does not throw, then B.C.D guaranteed to be getting valid UTF, but they still pay the penalty of the compiler thinking they can allocate memory and throw.

Comment #16 by dlang-bugzilla — 2015-04-29T11:09:02Z

(In reply to Walter Bright from comment #15) > It still allocates memory. But it's worth thinking about. Maybe assert()? Sure. > > I did not mean Unicode normalization - it was a joke (std.algorithm will > > "normalize" invalid UTF characters to the replacement character). But since > > .front on strings autodecodes, feeding a string to any generic range > > function in std.algorithm will cause auto-decoding (and thus, character > > substitution). > > That can be fixed as I suggested. Sorry, I'm not following. Which suggestion here will fix what in what way? > Global opt-in for foreach is not feasible. I agree - some libraries will expect one thing, and others another. > However, one can add an algorithm > "validate" which throws on invalid UTF, and put that at the start of a > pipeline, as in: > > text.validate.A.B.C.D; This is part of a solution. There also needs to be a way to ensure that validate was called, which is the hard part. > You brought up guessing the encoding of XML text by reading the start of it: > "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?" No, that's not what I meant. UTF-8 and old 8-bit encodings (ISO 8859-*, Windows-125*) both use the high bit in the byte to indicate Unicode. Consider a program that expects an UTF-8 document, but is actually fed one in an 8-bit encoding: it is possible (although unlikely) that text that is actually in an 8-bit encoding may be successfully interpreted as a valid UTF-8 stream. Thus, invalid UTF-8 can indicate a problem with the entire document, and not just the immediate sequence of bytes. > If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D > never are executed. But if A does not throw, then B.C.D guaranteed to be > getting valid UTF, but they still pay the penalty of the compiler thinking > they can allocate memory and throw. OK, so you're saying that we can somehow automatically remove the cost of handling invalid UTF-8 if we know that the UTF-8 we're getting is valid? I don't see how this would work in practice, or how it would provide a noticeable benefit in practice. Since the cost of removing a code path is negligible, I assume you're talking about exception frames, but I still don't see how this applies. Could you elaborate, or is this improvement a theory for now? Besides, won't A's output be a range of dchar, so B, C and D will not autodecode with or without this change?

Comment #17 by dlang-bugzilla — 2015-04-29T11:36:37Z

Let's see if I understand the situation correctly... let's say we have a chain: str.a.b.c So, str is a UTF-8 string, and a, b and c are range algorithms (they use .front/.popFront and provide .front/.popFront themselves). If a/b/c don't throw anything themselves, the nothrow attribute will be inferred from the .front/.popFront of the range in front of them (the range they consume), right? That means that if str.front can throw, c can't be nothrow. But if str.front is nothrow, then c CAN be nothrow. But what if we do this: str.forceDecode.a.b.c forceDecode doesn't use str.front - it reads the str directly, code unit by code unit, and inserts replacement characters where it sees error. This allows a, b and c to be nothrow. Unless I'm wrong, I think this idea could work for opt-in replacement character substitution. Following the 90/10 law, it should be easy to insert "forceDecode" in the few relevant places as indicated by a profiler. Does this proposal make sense?

Comment #18 by schuetzm — 2015-04-29T11:40:09Z

(In reply to Walter Bright from comment #15) > If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D > never are executed. But if A does not throw, then B.C.D guaranteed to be > getting valid UTF, but they still pay the penalty of the compiler thinking > they can allocate memory and throw. When `assert()` is used, whatever cost there is will of course disappear with `-release`. And IMO asserting is the right thing to do. Quoting the spec [1]: "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format." Note how it says "are in UTF-x format", not "should be". Therefore, a `string` not containing UTF8 is by definition a bug. Data with other (or unknown) encodings needs to be stored in `ubyte[]`. [1] http://dlang.org/arrays.html#strings

Comment #19 by dlang-bugzilla — 2015-04-29T11:51:00Z

(In reply to Vladimir Panteleev from comment #16) > (In reply to Walter Bright from comment #15) > > It still allocates memory. But it's worth thinking about. Maybe assert()? > > Sure. Wait, now I'm not sure. For some reason I was thinking of assert(false) which will always stop executions. But continuing upon encountering invalid UTF-8 in release mode might result in bad outcomes as well. The problem is that it's impossible to achieve 100% coverage and make sure that all Unicode-handling code in your program also handles invalid UTF-8 in a good way. Thus, an invalid UTF-8 handling problem might not be caught in testing but might cause an unpleasant situation in release mode (depending on what happens next after the assert is NOT thrown). I don't feel too strongly about this though, I think programs that operate on important data shouldn't run with -release anyway.

Comment #20 by dlang-bugzilla — 2015-04-29T11:53:41Z

(In reply to Marc Schütz from comment #18) > Data with other (or unknown) encodings needs to be stored in `ubyte[]`. Have you tried using ubyte[] to process ASCII text? It's horrible, you have to cast at every step, and nothing in std.string works even when it should.

Comment #21 by dfj1esp02 — 2015-04-29T16:00:18Z

(In reply to Vladimir Panteleev from comment #16) > > Global opt-in for foreach is not feasible. > > I agree - some libraries will expect one thing, and others another. Libraries don't determine on which data the program operates, it depends on the program and its environment, encoding mismatch has large scale consequence too: program crashes or corrupts data, libraries don't decide how to behave in such cases, it's a property of the program as a whole. Since they can't decide how to behave in such cases, they shouldn't decide and thus can't have different expectations on this matter, it's a per-program aspect.

Comment #22 by schuetzm — 2015-04-29T17:41:28Z

(In reply to Vladimir Panteleev from comment #20) > (In reply to Marc Schütz from comment #18) > > Data with other (or unknown) encodings needs to be stored in `ubyte[]`. > > Have you tried using ubyte[] to process ASCII text? It's horrible, you have > to cast at every step, and nothing in std.string works even when it should. For ASCII text, char[] is okay, UTF8 is a superset of ASCII. But you're right for other encodings. That's why those need to be converted "at the border": To UTF8 when read from a file or stdin, main() args, env vars, and from UTF8 to whatever on writing. Internally, they need to be UTFx encoded. This is the only sane way to handle different text encodings, IMO.

Comment #23 by code — 2015-04-30T22:13:10Z

(In reply to Vladimir Panteleev from comment #20) > (In reply to Marc Schütz from comment #18) > > Data with other (or unknown) encodings needs to be stored in `ubyte[]`. > > Have you tried using ubyte[] to process ASCII text? It's horrible, you have > to cast at every step, and nothing in std.string works even when it should. No one is suggesting you operate on ubyte[] as string. What people are is saying is you should validate a ubyte[] array before converting it to a string. This is by the way how readText works. You'll have to cast raw data to string to get strings with invalid UTF.

Comment #24 by code — 2015-04-30T22:40:49Z

If we validate encoding on data entry points such as readText or byLine, then decoding errors should be assertions rather than silent replacements, because it's a programming error to use unvalidated data as string.

Comment #25 by dlang-bugzilla — 2015-05-02T10:40:08Z

(In reply to Martin Nowak from comment #24) > If we validate encoding on data entry points such as readText or byLine, > then decoding errors should be assertions rather than silent replacements, > because it's a programming error to use unvalidated data as string. Although I think this approach is acceptable (as long as the program halts regardless of compilation flags, which shouldn't be a problem), I would like to note that there are situations in which it is impractical to either convert or validate the data. One example is implementations of text-based network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because text-based protocols often embed raw binary data. The program only needs to parse the ASCII text parts, so the ideal solution would be a string handling library which never decodes UTF-8 (something D doesn't have).

Comment #26 by dlang-bugzilla — 2015-05-02T10:47:18Z

(In reply to Sobirari Muhomori from comment #21) > Libraries don't determine on which data the program operates, it depends on > the program and its environment, encoding mismatch has large scale > consequence too: program crashes or corrupts data, libraries don't decide > how to behave in such cases, it's a property of the program as a whole. > Since they can't decide how to behave in such cases, they shouldn't decide > and thus can't have different expectations on this matter, it's a > per-program aspect. No. Almost nothing is a per-program aspect. A program may contain within itself a large number of big components, each functioning more-or-less independently, each of which which might have been single programs or even a collection of programs. If something prevents you from designing such a system, this indicates underlying encapsulation flaws. Such global changes of behavior as you are proposing can affect a component which is used by a second component, which is used by a third component etc. - and something along that line is likely to expect failures to occur in a predictable way.

Comment #27 by dfj1esp02 — 2015-05-05T07:25:40Z

If you want to request definite behavior in a fine-grained manner, that's always possible with configurable decoders, they would ignore default behavior if necessary.

Comment #28 by code — 2015-05-09T11:52:20Z

(In reply to Vladimir Panteleev from comment #25) > Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because > text-based protocols often embed raw binary data. The program only needs to > parse the ASCII text parts, so the ideal solution would be a string handling > library which never decodes UTF-8 (something D doesn't have). Yes, and you would be better off to handle such protocols as ubyte.

Comment #29 by dlang-bugzilla — 2015-05-16T17:48:33Z

(In reply to Martin Nowak from comment #28) > Yes, and you would be better off to handle such protocols as ubyte. What do you mean? Aren't you contradicting yourself from when you wrote: > No one is suggesting you operate on ubyte[] as string. ?

Comment #30 by code — 2015-07-17T07:01:22Z

(In reply to Vladimir Panteleev from comment #29) > (In reply to Martin Nowak from comment #28) > > Yes, and you would be better off to handle such protocols as ubyte. > > What do you mean? Aren't you contradicting yourself from when you wrote: > > > No one is suggesting you operate on ubyte[] as string. > > ? Well, b/c they contain delimited binary and ASCII data, you'll have to find those delimiters, then validate and cast the ASCII part to a string, and can then use std.string functions.

Comment #31 by code — 2015-07-17T07:03:55Z

(In reply to Martin Nowak from comment #30) > Well, b/c they contain delimited binary and ASCII data, you'll have to find > those delimiters, then validate and cast the ASCII part to a string, and can > then use std.string functions. BTW, this is what I already wrote in comment 23. Not sure why you only partially quoted my answer to suggest a contradiction.

Comment #32 by code — 2015-07-17T07:05:24Z

Summary: We should adopt a new model of unicode validations. The current one where every string processing function decodes unicode characters and performs validation causes too much overhead. A better alternative would be to perform unicode validation once when reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a valid unicode string. Invalid encodings introduced by string processing algorithms are programming bugs and thus do not warrant runtime checks in release builds. Also see https://github.com/D-Programming-Language/druntime/pull/1279

Comment #33 by dfj1esp02 — 2015-07-17T14:52:00Z

Removing autodecoding is good, but this issue is about making autodecode @nothrow @nogc.

Comment #34 by dlang-bugzilla — 2015-07-17T15:04:47Z

(In reply to Martin Nowak from comment #31) > BTW, this is what I already wrote in comment 23. Not sure why you only > partially quoted my answer to suggest a contradiction. Err, well, to be fair, you did not state this clearly in comment 23, which is why I asked for a clarification. I was not trying to maliciously nitpick your words, just tried to understand your point.

Comment #35 by issues.dlang — 2015-07-17T19:35:46Z

(In reply to Martin Nowak from comment #32) > Summary: > > We should adopt a new model of unicode validations. > The current one where every string processing function decodes unicode > characters and performs validation causes too much overhead. > A better alternative would be to perform unicode validation once when > reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a > valid unicode string. > Invalid encodings introduced by string processing algorithms are programming > bugs and thus do not warrant runtime checks in release builds. Exactly.

Comment #36 by dlang-bugzilla — 2015-07-17T19:42:10Z

Question, is there any overhead in actually verifying the validity of UTF-8 streams, or is all overhead related to error handling (i.e. inability to be nothrow)?

Comment #37 by jack — 2016-05-18T02:04:41Z

This entire discussion is moot unless you get Andrei on board with a breaking change to a very fundamental part of the language.

Comment #38 by code — 2016-05-20T14:20:03Z

(In reply to Vladimir Panteleev from comment #36) > Question, is there any overhead in actually verifying the validity of UTF-8 > streams, or is all overhead related to error handling (i.e. inability to be > nothrow)? I think it's fairly measurable b/c you need to add lots of additional checks and branches (though highly predictable ones). While my initial decode implementation https://github.com/MartinNowak/phobos/blob/1b0edb728c/std/utf.d#L577-L651 was transmogrify into 200 lines in the meantime https://github.com/dlang/phobos/blob/acafd848d8/std/utf.d#L1167-L1369, you can still use it to benchmark validation. I did run a lot of benchmarks when introducing that function, and the code path for decoding just remains slow, even with the throwing code path removed out of normal control flow.

Comment #39 by dlang-bugzilla — 2021-11-07T07:25:07Z

*** Issue 22473 has been marked as a duplicate of this issue. ***

Comment #40 by robert.schadek — 2024-12-07T13:35:18Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17300 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB