← Back to index | Original Bugzilla link

Bug 15710 – Replacement for std.utf.validate which does not throw

Status: ASSIGNED
Severity: enhancement
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2016-02-21T06:02:59Z
Last change time: 2024-12-01T16:26:04Z
Keywords: bootcamp
Assigned to: No Owner
Creator: Jonathan M Davis
Blocks: 16262

Comments

Comment #0 by issues.dlang — 2016-02-21T06:02:59Z

For whatever reason, std.utf.validate throws when a string is invalid Unicode rather than returning true or false (probably because it was easier at the time it was implemented, since the only wwy to validate Unicode other than reimplementing decode would have been to catch the exception it threw on failure). It also only works on strings and not arbitrary ranges of characters. So, we need a new function (e.g. isValidUnicode) which validates a range of characters and returns whether it's valid Unicode rather than throwing. Then we can deprecate validate and get that much closer to getting rid of UTFException and all of the extraneous Unicode validation that we have right now.

Comment #1 by greensunny12 — 2018-03-31T18:33:05Z

Yeah, I would be really cool if I don't have to do such ugly hacks when I just want to handle invalid UTF. --- private string toValidUTF(string s) { import std.algorithm.iteration : map; import std.range : iota; import std.utf; return s.representation.length .iota .map!(i => s.decode!(UseReplacementDchar.yes)(i)) .toUTF8; } try { outStream.validate; } catch (UTFException) { outStream = outStream.toValidUTF; } ---

Comment #2 by issues.dlang — 2018-03-31T22:08:09Z

Well having isValidUTF rather than validate would just make it so that you can use an if-else block instead of a try-catch. It wouldn't really clean that code up much. If you really want to clean up what you're doing in that example, you need to use byCodeUnit or byUTF, which use the replacement character. In that case, you wouldn't need to check for valid Unicode. If you want a string out the other side instead of a range of code units or code points, you then just call to!string or toUTF8 on it. e.g. auto codeUnits = str.byCodeUnit(); or auto dchars = str.byDchar(); // byUTF!dchar or str = str.byCodeUnit().to!string(); Now, if you want a string and don't want to allocate a new string if the string is valid, then you'd need to check whether the string is valid Unicode, but in that case you still don't need anything as complicated as what you wrote for toValidUTF. You'd just need something like try str.validate(); catch(UnicodeException) str = str.byCodeUnit().to!string(); and if we had isValidUTF, then you'd have if(!str.isValidUTF()) str = str.byCodeUnit().to!string(); So, while isValidUTF would help, it's mostly just getting rid of an unnecessary exception, which does clean up the code in this case, but not drastically. It's byCodeUnit or byUTF that really clean things up here. Now, as for adding isValidUTF, I have a PR for it in the PR queue, and Andrei approved the symbol, but he rejected the implementation. He basically wanted me to completely redesign how decode works internally and that the superficial changes I had made to make it work with isValidUTF were too ugly to live. So, at some point here, I need to go back and figure out how to rework all that again, which is not going to be pretty.

Comment #3 by robert.schadek — 2024-12-01T16:26:04Z

THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/10161 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB