Bug 15949 – Improve readtext handling of byte order mark (BOM)
Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2016-04-21T20:02:45Z
Last change time
2018-02-14T11:24:57Z
Assigned to
No Owner
Creator
Jesse Phillips
Comments
Comment #0 by Jesse.K.Phillips+D — 2016-04-21T20:02:45Z
Problem:
I've hit this many times in Windows. I try to read in a file with std.file.readText and get: "Syntax error at line 0"
This is because some Microsoft program has decided to insert a UTF-8 Byte Order Mark (BOM) into the beginning of the file (0xEF 0xBB 0xBF). But readText really shouldn't automatically convert a file's content based on the BOM specified.
Suggested fix:
I think readText should validate and skip the BOM. It should check that the BOM is not UTF-16LE (0xFF 0xFE), UTF-16BE (0xFE 0xFF), UTF-32LE (FF FE 00 00), UTF-32BE (0x00 0x00 0xFE 0xFF), if it is one of those then it should throw an exception that the file being read is one of those encoding and will not be converted to UTF-8 string.
The corresponding std.file.readText!wstring and std.file.readText!dstring should perform equivalent validation. If it is no cost to change the byte order then that should be done.
1. https://en.wikipedia.org/wiki/Byte_order_mark
Comment #2 by github-bugzilla — 2018-02-14T11:24:56Z
Commits pushed to master at https://github.com/dlang/phoboshttps://github.com/dlang/phobos/commit/5d52a81e4dede77fe75eb3215f1b24b898963f26
Fix issue 15949: Make readText check BOMs.
This makes it so that readText checks for a BOM. If there is a BOM, it
is for UTF-8, UTF-16, or UTF-32, and it doesn't match the requested
string type, then a UTFException is thrown. Other encodings are let
through in case they happen to work with the requested string type and
pass UTF validation.
Also, this makes it so that readText checks the alignment of the buffer
against the requested string type and throws a UTFException instead of
letting the cast throw an Error.
https://github.com/dlang/phobos/commit/d43925ec6048f49b56c9f4b0cc22ed07999f63a1
Merge pull request #6113 from jmdavis/issue15949
Fix issue 15949: Make readText check BOMs.
merged-on-behalf-of: Vladimir Panteleev <[email protected]>