← Back to index | Original Bugzilla link

Bug 17861 – UTF Decode fails with exception

Status: NEW
Severity: enhancement
Priority: P4
Component: druntime
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2017-09-26T20:45:46Z
Last change time: 2024-12-07T13:37:37Z
Assigned to: No Owner
Creator: Etienne

Comments

Comment #0 by etienne — 2017-09-26T20:45:46Z

When decoding UTF code points, the invalid characters should be replaced by ? rather than throwing errors. Most uses for a UTF parser are read-only and this breaks any application for no reason. ie. the `decode` function should not behave like it's doing a unit test on all UTF text. The solution in my program is: https://github.com/dlang/druntime/blob/master/src/rt/util/utf.d#L292 Should be replaced by: return '?'

Comment #1 by schveiguy — 2017-09-26T21:08:10Z

std.utf.decode accepts a parameter to allow for exactly what you ask: import std.utf, std.typecons; void main() { size_t idx = 0; auto x = decode!(Yes.useReplacementDchar)("\xff", idx); assert(x == 0xfffd); } What is the actual use case you are using the druntime version for?

Comment #2 by etienne — 2017-09-26T21:22:16Z

That's not terribly useful as it is hidden behind the whole of std.string, which is what actually deals with UTF8 text.

Comment #3 by schveiguy — 2017-09-26T21:27:23Z

I looked at decode because that is what I thought you were referring to. What is the actual use case? What is the code that causes the exception? It's hard to suggest help when you are simply asking for a significant (and breaking) implementation change.

Comment #4 by etienne — 2017-09-26T21:33:46Z

I was encountering this in the vibe.d parseJsonString function, when decoding some Json string with user-generated data. About a year ago I found this was the simplest way to patch this up but I'll run another test case. I don't remember the exact trace but I see that it does use std.decode without the replacementDchar option, so maybe that would solve it. I remember it being due to phobos calling it though, not sure where. I have an issue with the idea that string parsers can throw on a whim and break the application, because as I said this user generated content can cause a crash that way.

Comment #5 by schveiguy — 2017-09-26T23:01:52Z

(In reply to Etienne from comment #4) > I have > an issue with the idea that string parsers can throw on a whim and break the > application, because as I said this user generated content can cause a crash > that way. This is understandable, but it is going to be difficult to justify a change from the default, as much code may depend on the current default.

Comment #6 by etienne — 2017-09-26T23:08:45Z

I think it might have thrown this exception in the foreach loop of a string due to an invalid code point. I don't see `rt.decode` called anywhere else, not even in phobos. I'm not sure anyone has actually planned for this behavior, because it's so marginal, undocumented and difficult to trigger, so I'd say it might affect mostly some unit tests that could expect it. Where in documentation does it say that foreach char[] can throw on unicode error? https://tour.dlang.org/tour/en/gems/unicode This specialized article doesn't even talk about any kind of exceptions.

Comment #7 by jrdemail2000-dlang — 2017-09-27T01:08:21Z

(Not sure a bug is the correct place to discuss, but...) In many of the apps I work on it's important that the application have control over the error handling behavior when an invalid UTF-8 sequence is encountered. Any time data is received from an external source. When using Phobos, it's often necessary to do this validation on initial input without giving control to Phobos routines. This is not always the most convenient. What would be really useful would be to have some sort of configurable utf-8 error handling setting that could be established for a scope. e.g. When entering a scope, set the invalid character behavior to replace, drop, or throw. The low level routines like std.utf.decode would obey the settings. When exiting the scope the previous error handling setting would be re-established. A somewhat similar idea came up in a forum thread recently with respect to establishing the precision used converting floating point numbers to strings with std.conv.to (https://forum.dlang.org/thread/[email protected]). Don't know how hard something like this would be, perhaps prohibitively hard, but it would be very pragmatic for many production applications.

Comment #8 by issues.dlang — 2017-10-03T19:23:08Z

This has been discussed before. There's a strong argument for making it so that decode uses the replacement character by default (it's even what the Unicode standard says you should do), and all string-based stuff then follows suit, at which point anyone wanting exceptions would need to call decode manually with the template argument indicating that that's what they wanted - which is the opposite of what we have now. And Walter is actually in favor of using the replacement character instead of exceptions and possibly even making the change in spite of the issues, but there have been some folks who have been strongly opposed to that. The problem is twofold: 1. Making the change risks silently breaking a ton of code. 2. Others (Vladimir in particular IIRC) have argued about how negative it is to have the contents of strings silently changed, since there are cases where it would be highly detrimental for that to happen. And on some level, all of this gets wrapped into the auto-decoding debate, because that's the main reason that this is out of the control of the user. front and popFront on strings call decode for you and call it in the way that results in exceptions on invalid UTF instead of using the replacement character. Anyone making the calls manually has the choice. So, I think that the chances are very high that we would go with the replacement character by default rather than exceptions (maybe not even have the exceptions at all) if we were starting from scratch - just like we wouldn't have auto-decoding if we were starting from scratch. But it's highly questionable that we can get away with making the change now due to the ramifications that it will have on existing code. At this point, the situation with decoding code points and not having it throw is in pretty much the same boat as using strings with range-based code and not auto-decoding: you have to use wrappers like byCodeUnit and/or special-case your code on strings. And to avoid the exceptions on bad Unicode, you either have to not be decoding code points, or you need to do so yourself with std.utf.decode. No, that's not ideal, but no one has been able to come up with a reasonable way to change the status quo with any kind of reasonable deprecation process.

Comment #9 by etienne — 2017-10-03T21:37:43Z

The std.utf.decode debate is a rough one and seems to have its workarounds. On the other hand, you have this forgotten code at rt.util.utf where everything is so low-level and there is no option to use a replacement character. This is something that comes up on a foreach loop on a string. I don't know why anyone would expect foreach to throw? It's a little strange. Obviously the whole library should default to a replacement character, but rt.util.utf is a little more urgent because of how much of a surprise it is for a foreach loop to be throwing, whereas you don't really expect it to behave like a function and it can happen in edge cases on release builds where things become very difficult to debug.

Comment #10 by issues.dlang — 2017-10-03T22:02:23Z

If you don't want foreach to throw when decoding, then don't use foreach to decode. Call std.utf.decode yourself rather than having foreach automatically call the druntime version for you. We can't change foreach right now for exactly the same reasons that we can't change front and popFront for strings. If it were decided that the breakage that would result from changing front and popFront to use the replacement character was low enough that we were willing to just change them and let things break, then we'd change foreach. But as long as it's considered too risk to change front and popFront to not throw, it's too risky to change foreach. There is nothing special about foreach that makes it any different from front and popFront for strings in terms of whether we can risk breaking code. I'm sorry that you find the current behavior surprising, but string handling in D in general requires understanding a number of quirks that we'd like to change but can't due to how much existing code would break if we did make the desired changes.

Comment #11 by etienne — 2017-10-03T22:06:46Z

If the current idea is to not fix the bugs due to possible breakage, why have a bug tracker for druntime in the first place? Also, what's the point of having unit tests if you can't rely on them?

Comment #12 by issues.dlang — 2017-10-03T22:34:56Z

(In reply to Etienne from comment #11) > If the current idea is to not fix the bugs due to possible breakage, why > have a bug tracker for druntime in the first place? The current behavior is not a bug. The code is functioning exactly as designed. That design is arguably a bad design, and many of us would like to change it, but changing it would break existing code, so it is unlikely that it will be changed. There simply isn't a good deprecation path that would allow us to go from one behavior to the other - certainly no one has come up with one thus far. > Also, what's the point of having unit tests if you can't rely on them? What unit test are you referring to? Nothing about the current behavior of foreach and decoding code points should make it so that unit tests are unreliable. foreach is completely consistent in what it does. It's just that it's designed to do something that we wouldn't design it to do if we were doing things from scratch. You weren't previously aware that foreach threw when decoding invalid UTF. Now, you are, and you can write your code accordingly. The information about foreach throwing when decoding invalid UTF should be in the spec, but I don't know if it is or not. The spec doesn't always have the information that it should, but this is how foreach was designed and has worked ever since it was made so that it could decode code points. And it's the intended behavior until such time as we can figure out how to move to using the replacement character without breaking code in the process, which unfortunately, may very well be never. Right now, literally, our best option that would involve making the change would be to make the change and warn in the changelog that that's what we're doing, and anyone reading it would then have the opportunity to scour their code to see if they needed to change it as a result. The breakage would be silent and easy to miss even if in many cases, it wouldn't matter. And as such, thus far, that solution has been deemed unacceptable. So, if you know of a way to make it so that foreach can be changed to use the replacement character without silently breaking code, then great. We'd love to hear it. As it stands, this is one of those design decisions that we regret in retrospect but seem to be stuck with.

Comment #13 by etienne — 2017-10-03T22:46:48Z

You have to choose whether it's a bug or a feature. I think everyone is ready to live with that, but if you live up to it and consider it a feature it'll have to be documented. Just a 1 liner somewhere saying "Foreach (string) can throw unicode errors!" That'll be a good solution to this issue, because right now everyone is forced to learn it the hard way. This being said, I don't see Google Chrome crashing every time it sees an invalid code point. I'm not sure anyone would think about catching that on the first try if they were to do an Ajax call. I'm also pretty sure they'd be happy with the code path where it doesn't throw when the invalid code point comes up. If you know of anyone doing software specifically for unicode valiation, maybe they'd need to be warned but that's about it for me. So yeah, just wave it as a feature or squash the bug, but don't stay in between forever.

Comment #14 by jrdemail2000-dlang — 2017-10-03T22:49:30Z

Changing the default behavior for the individual functions would cause backward compatibility issues. Any thoughts on having run-time selectable behavior that would override the defaults? The default behavior could be left unchanged. The two issues that come to mind: - Functions currently nothrow could lose that status if throw is an option. - Performance: Compile-time choices are faster than run-time. The advantage of a run-time selectable behavior is that it would support the need many programs have for an application specific behavior. There is no single default appropriate for all cases.

Comment #15 by issues.dlang — 2017-10-03T23:05:27Z

(In reply to Jon Degenhardt from comment #14) > Changing the default behavior for the individual functions would cause > backward compatibility issues. Any thoughts on having run-time selectable > behavior that would override the defaults? The default behavior could be > left unchanged. > > The two issues that come to mind: > - Functions currently nothrow could lose that status if throw is an option. > - Performance: Compile-time choices are faster than run-time. > > The advantage of a run-time selectable behavior is that it would support the > need many programs have for an application specific behavior. There is no > single default appropriate for all cases. In general, Walter is against having flags that determine the behavior of the language, and that's essentially what you're suggesting, even if it's set at runtime rather than at compile time. The reality of the matter is that as much as the current behavior sucks, it's trivial to work around it by calling decode yourself. So, I really don't see any reason to make it configurable. That would just make it so that you don't know what the code is designed to do when you look at it. I think that it's far better to just be clear on how UTF decoding works in D than to try and make anything at the language level configurable. The standard library already provides the tools necessary to allow the programmer to choose how they want to handle invalid UTF, even if the defaults aren't exactly ideal. (In reply to Etienne from comment #13) > You have to choose whether it's a bug or a feature. I think everyone is > ready to live with that, but if you live up to it and consider it a feature > it'll have to be documented. Just a 1 liner somewhere saying "Foreach > (string) can throw unicode errors!" > > That'll be a good solution to this issue, because right now everyone is > forced to learn it the hard way. > > This being said, I don't see Google Chrome crashing every time it sees an > invalid code point. I'm not sure anyone would think about catching that on > the first try if they were to do an Ajax call. I'm also pretty sure they'd > be happy with the code path where it doesn't throw when the invalid code > point comes up. If you know of anyone doing software specifically for > unicode valiation, maybe they'd need to be warned but that's about it for me. > > So yeah, just wave it as a feature or squash the bug, but don't stay in > between forever. If the spec isn't clear about the fact that decoding invalid UTF with foreach will throw an exception, then the spec needs to be updated accordingly, but the current behavior is very much as designed and not a bug. I have no idea if the spec says anything about invalid UTF or not. I'd have to comb through it to know for sure. But the spec is often missing details that it should have, and sometimes, when it does say something, it's concise enough in what it says that it's easily missed. It wouldn't surprise me at all if it were stated somewhere in there, and you just missed it, and it wouldn't surprise me if it's not there. Regardless, I completely agree that the spec should be clear on the matter.

Comment #16 by robert.schadek — 2024-12-07T13:37:37Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17351 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB