Bug 17553 – std.json should not do UTF decoding when encoding JSON
Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86
OS
All
Creation time
2017-06-25T17:54:08Z
Last change time
2018-01-05T13:29:34Z
Assigned to
No Owner
Creator
Andre
Comments
Comment #0 by andre — 2017-06-25T17:54:08Z
It is possible to read a file into a string. JSONValue happily accepts this string which contains binary data.
But the moment you want to get data using js.toString an exception is thrown:
core.exception.UnicodeException@src\rt\util\utf.d(292): invalid UTF-8 sequence
import std.json;
import std.file: read;
void main()
{
string s = cast(string) read(`C:\D\dmd2\windows\bin\dmd.exe`);
JSONValue js = JSONValue(s);
string s2 = js.toString; // this line will throw the exception
}
Comment #1 by dlang-bugzilla — 2017-06-26T10:14:26Z
As far as Phobos (and some parts of the language itself) are concerned, D strings are expected to be UTF-encoded, i.e. contain a valid stream of UTF characters. Your program elides that assumption by using a cast - the normal way to read text data into a string is the readText function, which does UTF validation. When using readText, reading a file which does not contain valid UTF will result in an exception being thrown.
As for JSON encoding - although most of JSON transformations concern themselves with just the ASCII part, the JSON standard does forbid encoding Unicode control characters, which may appear in a valid D string but must not appear in a JSON-encoded one. This includes the high control characters (code points 0x80 to 0x9F); so, the encoding code must check for these code points when constructing the JSON string. Although they could in theory be special cased, the most straight-forward way to do it is to look at the input string as a range of Unicode code points (dchars), i.e. rely on auto-decoding, which is what the current implementation does.
In any case, JSON strings are certainly not meant to store binary data - even if the example "worked" (for a certain definition of "work"), the resulting JSON object will not be in any particular encoding. Even though the JSON syntax is restricted to ASCII characters, JSON itself is not - it is Unicode aware, and contains instructions on how to properly encode and decode Unicode characters, so it can't be used for storing arbitrary binary data.
If you have a specific use case in mind which is in line with the JSON spec and how D deals with Unicode and strings, please reopen; otherwise, there is no actionable defect presented in this issue.
Comment #2 by dlang-bugzilla — 2017-06-26T12:23:30Z
RFC 7159 specifies that Unicode control characters don't need escaping, so actually we can avoid auto-decoding when encoding JSON.
Comment #3 by github-bugzilla — 2017-07-03T09:07:47Z