Bug 13512 – Allow non-UTF-8 encoding in shebang line

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
All
OS
Linux
Creation time
2014-09-20T16:44:52Z
Last change time
2017-08-16T13:23:09Z
Keywords
pull
Assigned to
No Owner
Creator
Ketmar Dark

Attachments

IDFilenameSummaryContent-TypeSize
1430hello1.dsample codeapplication/octet-stream87
1431z00.patchproposed fixtext/plain1015

Comments

Comment #0 by ketmar — 2014-09-20T16:44:52Z
Created attachment 1430 sample code yes, i have to attach sample. it must be used as-is, not copypasted from bugzilla. yes, i know that shebang is not in utf-8. so what? my whole locale and filesystem is not utf-8, and everything works ok. except dmd, which can't compile valid code with non-utf charactes in the place dmd never needs to look into.
Comment #1 by dlang-bugzilla — 2014-09-21T01:15:35Z
OMG, someone still uses KOI-8? This is probably a WONTFIX because it conflicts with the D spec. DMD does not allow invalid UTF-8 even in comments, because Unicode conversion is done as a separate step from the lexer/tokenizer.
Comment #2 by ketmar — 2014-09-21T01:28:33Z
(In reply to Vladimir Panteleev from comment #1) > OMG, someone still uses KOI-8? yes, it's me. ;-) > This is probably a WONTFIX because it conflicts with the D spec. DMD does > not allow invalid UTF-8 even in comments, because Unicode conversion is done > as a separate step from the lexer/tokenizer. actually, it's a very simple patch to lexer (i already did that). but if dmd forbids a perfectly valid shebang… well, i'm sure that D is not in the position to change existing standards. if dmd can't compile valid code… it's not a bug in users' locale, it's THE bug in dmd. let's redefine POSIX then, POSIX is not cool. let's dictate what usernames are allowed. let's dictate what pathes are allowed. and so on. D über alles, fsck standards!
Comment #3 by dlang-bugzilla — 2014-09-21T01:32:39Z
DMD is not going to change existing standards, but it can choose to not follow them. After all, you don't expect to have a working KOI-8 shebang on a UTF-16 source file? You can work around this issue as follows: sudo ln -s /opt/dmd/пробы /opt/dmd/tests then using #!/opt/dmd/tests/rdmd as your shebang.
Comment #4 by ketmar — 2014-09-21T01:49:04Z
i'm not also expecting correct EBCDIC decoding. but it's not UTF-16 file, and ahering the standard is easy in this case: just stop validating things that should not be validated. i.e. either kill shebang feature entirely or do it right. and yes, trying to validate comments drives me mad too. i mean hey, this is comment, just skip it and allow me to write any BS there. i know how i can workaround this, but i completely refuse to understand why this workaround is necessary at the first place. it's complete nonsence. yes, it's a very minor ussue, but i want this bug to be officially fixed or marked as WONTFIX to clarify some of my inner thoughts.
Comment #5 by hsteoh — 2014-09-21T03:21:40Z
Actually, a deeper underlying issue that is being assumed, not just by dmd but by much of druntime/phobos that interfaces with the outside world, is that system-level things like filenames are UTF-8 encoded. While it's perfectly fine to do everything only in Unicode internally in D programs, this ultimately unfounded assumption can cause problems, e.g., if the filesystem uses a non-utf8 encoding, or if the program is (hypothetically) running on an EBCDIC machine, or if the D program has to interface with non-Unicode legacy programs. For example, writeln assumes the target terminal understands utf8, which may not necessarily be true.
Comment #6 by dfj1esp02 — 2014-09-21T11:59:56Z
(In reply to Ketmar Dark from comment #4) > i'm not also expecting correct EBCDIC decoding. but it's not UTF-16 file, > and ahering the standard is easy in this case: just stop validating things > that should not be validated. AFAIK, the standard text encoding on posix today is utf-8, so D adheres to this standard. > i.e. either kill shebang feature entirely or do it right. Shebang is sort of brittle by design. It works only for text files (which doesn't always hold) and if the text file encoding matches that of your system. If both conditions don't hold, you should find another way, like finding executable by file extension - that works independently of file content.
Comment #7 by ketmar — 2014-09-21T14:49:44Z
(In reply to Sobirari Muhomori from comment #6) > AFAIK, the standard text encoding on posix today is utf-8 oh, and for what reason we have that strange "locale settings" then? also, can you point me at the exact standard part which tells that text encoding is utf-8 regardless to current locale settings? or the part that tells anything about text encoding for that matter. and no, GNU/Linux is *not* The New Standard Maker. > Shebang is sort of brittle by design. It works only for text files WUT?! O_O it works perfectly for *any* type of file. it's completely ok to place binary data after shebang if interpreter can cope with that. > and if the text file encoding matches that of your system. and the given example matches. yet dmd refuses to compile my sample. not *run*, but *compile*. the right shebang support in dmd must be like this: check if the first chars of the file forms shebang, and if they are, then just skipping other chars until '\n'. and skip '\n'. that's all. no validation. no martian logic. just skipping chars.
Comment #8 by ketmar — 2014-09-21T23:42:45Z
Created attachment 1431 proposed fix just for completeness sake.
Comment #9 by dfj1esp02 — 2014-09-22T11:04:39Z
(In reply to Ketmar Dark from comment #7) > (In reply to Sobirari Muhomori from comment #6) > > AFAIK, the standard text encoding on posix today is utf-8 > oh, and for what reason we have that strange "locale settings" then? AFAIK, locale defines time, number formats and user language for localization. It's orthogonal to text encoding. > also, can you point me at the exact standard part which tells that text encoding > is utf-8 regardless to current locale settings? posix is not very strict with standardization, it only roughly describes what can be done and how. After all, it's not really a standard, but just written down de facto conventions, which established some other way. > > Shebang is sort of brittle by design. It works only for text files > WUT?! O_O it works perfectly for *any* type of file. If it would work perfectly for any type of file, you wouldn't report this problem in the first place as everything would just work. > it's completely ok to > place binary data after shebang if interpreter can cope with that. Binary data formats are not that flexible. And if interpreter is sufficiently smart, it can cope with various text encodings too. > > and if the text file encoding matches that of your system. > and the given example matches. yet dmd refuses to compile my sample. not > *run*, but *compile*. utf-8 matches koi8 only in ascii range. If you use only ascii, it should work. > the right shebang support in dmd must be like this: check if the first chars > of the file forms shebang, and if they are, then just skipping other chars > until '\n'. and skip '\n'. that's all. no validation. no martian logic. just > skipping chars. D source is a text file, and text files have single encoding. Having variable encoding contradicts usual logic of text files.
Comment #10 by ketmar — 2014-09-22T17:25:20Z
(In reply to Sobirari Muhomori from comment #9) > posix is not very strict with standardization, it only roughly describes > what can be done and how. After all, it's not really a standard, but just > written down de facto conventions, which established some other way. WUT?! sorry, i don't want to speak with trolls. bye.
Comment #11 by andrei — 2014-09-22T17:50:54Z
Not sure what best to do about this. I'd say if #! is detected, the first line should be just scanned through the first \n and ignored. In a way the semantics of the shebang line is determined by the environment. Regular scanning shouldn't be affected.
Comment #12 by ketmar — 2014-09-22T18:34:24Z
(In reply to Andrei Alexandrescu from comment #11) > Not sure what best to do about this. I'd say if #! is detected, the first > line should be just scanned through the first \n and ignored. In a way the > semantics of the shebang line is determined by the environment. Regular > scanning shouldn't be affected. my attached patch does right that: it just skips shebang line if it is found and not changing other lexing code. and it mostly consists of deleted lines, so we now have less code to test! ;-)
Comment #13 by andrei — 2014-09-22T22:57:32Z
(In reply to Ketmar Dark from comment #12) > (In reply to Andrei Alexandrescu from comment #11) > > Not sure what best to do about this. I'd say if #! is detected, the first > > line should be just scanned through the first \n and ignored. In a way the > > semantics of the shebang line is determined by the environment. Regular > > scanning shouldn't be affected. > my attached patch does right that: it just skips shebang line if it is found > and not changing other lexing code. and it mostly consists of deleted lines, > so we now have less code to test! ;-) Sounds good. Did you convert it to a pull request?
Comment #14 by ketmar — 2014-09-23T00:50:33Z
(In reply to Andrei Alexandrescu from comment #13) > Sounds good. Did you convert it to a pull request? no. i'm not using github, sorry.
Comment #15 by dfj1esp02 — 2014-09-23T11:23:30Z
(In reply to Andrei Alexandrescu from comment #11) > Not sure what best to do about this. I'd say if #! is detected, the first > line should be just scanned through the first \n and ignored. In a way the > semantics of the shebang line is determined by the environment. Regular > scanning shouldn't be affected. There were two other requests for full support for legacy encodings. If such support is introduced, it should probably extend to the entire source code. It may be not in a language standard, just a compiler vendor-specific extension. Maybe a compilation option in dmd build script.
Comment #16 by dfj1esp02 — 2015-02-25T15:39:59Z
BTW, java supports shebang like this: the runner extracts the following code, compiles and runs it. If the cached compiled code is newer than the script, the runner just runs the executable. Something like this can be written for D too to support variable text encoding.
Comment #17 by dlang-bugzilla — 2017-07-02T15:13:14Z
Comment #18 by github-bugzilla — 2017-07-03T00:40:08Z
Commits pushed to master at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/9f50d033696d686f00527a8b5f8efbb358fc2245 Fix Issue 13512 - Allow non-UTF-8 encoding in shebang line Adapted from https://issues.dlang.org/attachment.cgi?id=1431&action=diff https://github.com/dlang/dmd/commit/c25d606e7d8db7ed36218328eb37853c79902f39 Add test case for issue 13512 From https://issues.dlang.org/attachment.cgi?id=1430 https://github.com/dlang/dmd/commit/48d5ef139b4d1aa874a3094bcccd16114c3f3349 Merge pull request #6959 from CyberShadow/pull-20170702-145440 Fix Issue 13512 - Allow non-UTF-8 encoding in shebang line merged-on-behalf-of: Andrei Alexandrescu <[email protected]>
Comment #19 by github-bugzilla — 2017-08-07T13:17:09Z
Commits pushed to newCTFE at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/9f50d033696d686f00527a8b5f8efbb358fc2245 Fix Issue 13512 - Allow non-UTF-8 encoding in shebang line https://github.com/dlang/dmd/commit/c25d606e7d8db7ed36218328eb37853c79902f39 Add test case for issue 13512 https://github.com/dlang/dmd/commit/48d5ef139b4d1aa874a3094bcccd16114c3f3349 Merge pull request #6959 from CyberShadow/pull-20170702-145440
Comment #20 by github-bugzilla — 2017-08-16T13:23:09Z
Commits pushed to stable at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/9f50d033696d686f00527a8b5f8efbb358fc2245 Fix Issue 13512 - Allow non-UTF-8 encoding in shebang line https://github.com/dlang/dmd/commit/c25d606e7d8db7ed36218328eb37853c79902f39 Add test case for issue 13512 https://github.com/dlang/dmd/commit/48d5ef139b4d1aa874a3094bcccd16114c3f3349 Merge pull request #6959 from CyberShadow/pull-20170702-145440 Fix Issue 13512 - Allow non-UTF-8 encoding in shebang line merged-on-behalf-of: Andrei Alexandrescu <[email protected]>