Bug 1985 – import expression should return ubyte[] not string

Status
REOPENED
Severity
normal
Priority
P3
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2008-04-10T20:57:19Z
Last change time
2024-12-13T17:48:24Z
Keywords
spec
Assigned to
No Owner
Creator
Bill Baxter
Moved to GitHub: dmd#17694 →

Comments

Comment #0 by wbaxter — 2008-04-10T20:57:19Z
The compiler does not convert the encoding of the imported files to utf8, so it should not pretend that it knows the contents of the file are utf8. In fact, probably one of the most practically useful applications of the import expression is to import binary files, which it is impossible for the compiler to put in utf8 format. So the conclusion is that import("foo.dat") should evaluate to ubyte[], not char[]. It can be cast to char[] if the developer happens to know that the file is, in fact, text. Currently the situation is reversed -- the data must be cast to ubyte[] if the developer knows it is not, in fact, utf8 text.
Comment #1 by caron800 — 2008-04-11T02:04:27Z
On 11/04/2008, [email protected] <[email protected]> wrote: > > import expression should return ubyte[] not string Shouldn't it return invariant(ubyte)[] ?
Comment #2 by caron800 — 2008-04-11T02:20:39Z
On 11/04/2008, Bill Baxter <[email protected]> wrote: > > Shouldn't it return invariant(ubyte)[] ? > > In D2 probably so, but I filed the bug against D1.028. Makes sense. I suppose in D2 it depends on the answer to the following question. If I write auto a = import("filename"); auto b = import("filename"); do we get two separate copies, or do a and b both point to the same memory? If the latter, it should definitely be invariant(ubyte)[] in D2, though as you say, ubyte[] in D1.
Comment #3 by aarti — 2008-04-11T03:24:48Z
Yes, I also think that implying that imported file is char[] is not best decision. Maybe even imported type should be void[], so that it must be explicitly casted to proper type.
Comment #4 by caron800 — 2008-04-11T03:32:39Z
On 11/04/2008, [email protected] <[email protected]> wrote: > Maybe even imported type should be void[], so that it must be explicitly casted > to proper type. No, it should be ubyte. The reason is that void arrays can contain pointers, and ubyte arrays can't. A void array means that the garbage collector has to scan it, looking for anything that looks like it might be an address, and if it finds such a collection of bits by accident, then something will be marked as "in use", that actually isn't. If the array came from a file, it can't very well have meaningful pointers into RAM, so I agree with the original poster that it should be ubyte[] for D1, and I would add invariant(ubyte)[] for D2.
Comment #5 by aarti — 2008-04-11T06:16:31Z
Answer to comment #4 This isn't (direct) argument against my proposal. It would be if GC scanning void[] is fundamentally "good thing". But in fact I don't think that in this case it is such kind of design decision. (There were already proposals to change this behavior.) But in case of reading external files, the problem is that compiler just *don't know* format of imported file. So IMHO the best thing to do is to reflect this situation in language, and force user to cast content of file to real type. In case there will stay default cast to some type in import it is really difficult to justify which default behavior is better. I agree with Bill that in most GUI application it would be better to have imported array of bytes. But you can also think about DB framework in which external file is used to define schema of database (see: U++ framework written in C++). In this case some text format is much more natural...
Comment #6 by wbaxter — 2008-04-11T07:36:10Z
(In reply to comment #5) > Answer to comment #4 > > This isn't (direct) argument against my proposal. It would be if GC scanning > void[] is fundamentally "good thing". But in fact I don't think that in this > case it is such kind of design decision. (There were already proposals to > change this behavior.) > > But in case of reading external files, the problem is that compiler just *don't > know* format of imported file. So IMHO the best thing to do is to reflect this > situation in language, and force user to cast content of file to real type. > > In case there will stay default cast to some type in import it is really > difficult to justify which default behavior is better. I agree with Bill that > in most GUI application it would be better to have imported array of bytes. But > you can also think about DB framework in which external file is used to define > schema of database (see: U++ framework written in C++). In this case some text > format is much more natural... > Your argument is right on, but ubyte[] *is* the type that means "I don't know what the heck this data is, but it's just data, not pointers". void[] means "I don't know what the heck this data is, it might even be full of pointers". I don't see any way that file that is on disk at *compile time* could contain pointers that are relevant to the program later on at *run time*. So ubyte[] is the proper type.
Comment #7 by aarti — 2008-04-11T09:20:21Z
> Your argument is right on, but ubyte[] *is* the type that means "I don't know > what the heck this data is, but it's just data, not pointers". void[] means "I > don't know what the heck this data is, it might even be full of pointers". > I don't see any way that file that is on disk at *compile time* could contain > pointers that are relevant to the program later on at *run time*. So ubyte[] > is the proper type. I think I understand your way of thinking. But the problem here is different. With void[] you can not do anything, so you have to cast. With ubyte[] you can use data as they are, because they have in fact specific type. But what if file contains array of int's? In such a case you are in exactly same situation as with char[]. Compiler chooses one type to which it casts by default, exactly like currently with char[]. And this choice can be wrong. If it is good or wrong depends only on application. In GUI ubyte[] is more appropriate (e.g. importing icon), but in DB framework string is better (importing db schema). Having written this all I slowly get to conclusion that current situation is not so bad :-) char[] would be usually much more appropriate for lower layers in application than ubyte[] (mostly for loading and compiling domain languages). Alternative with void[] is IMHO theoretically better, but practically you will have even more casts in you code... Tough decision... :-)
Comment #8 by wbaxter — 2008-04-11T15:18:03Z
(In reply to comment #7) > I think I understand your way of thinking. But the problem here is > different. With void[] you can not do anything, so you have to cast. > With ubyte[] you can use data as they are, because they have in fact > specific type. Well, it's a specific type that represents raw memory. But I get your point too. If there were a void_without_pointers type that couldn't be used without casting, then I'd be happy for import("file") to use that. But with void[] getting scanned by the gc for pointers it's a no-go. If you do just one import("big_binary_file") it could add enough false pointers that your app would hold on to big gobs of memory it doesn't need to.
Comment #9 by code — 2012-05-27T03:16:35Z
UTF validation and conversions can be done at compile time, although the interpreter is currently a little slow for this.
Comment #10 by hsteoh — 2018-01-16T20:09:12Z
The compiler has no way to know whether an imported file is even text to begin with. For all you know, the code could be: ```` auto logo = import("logo.png"); // embed image inside executable ```` It's not the compiler's job to decode the contents of arbitrary files. Let import(filename) return ubyte[], and let the user code decide what to do with it.
Comment #11 by dfj1esp02 — 2018-03-06T18:33:34Z
I think it's better to replace import expression with intrinsic, say, importText (like std.file.readText). This will also reduce grammar complexity and remove second and rarely used meaning from the import keyword. Due to how rarely import expression is used it confuses newcomers that are used only to it's usage as module import statement.
Comment #12 by razvan.nitu1305 — 2021-08-03T08:53:14Z
1. The primary purpose of the import keyword is to import other modules. Since its meaning is tied to importing text files, it makes sense to return a string. If you want to read files of arbitrary types, std.file.read should be used. 2. This particular semantic of the import keyword is not widely spread. I, personally, have never seen it in any of the D projects that I contributed to. So, since this is not a widely used feature and the import semantic is tied to text file, I will close this as WONTFIX.
Comment #13 by ag0aep6g — 2021-08-03T10:41:45Z
(In reply to RazvanN from comment #12) > 1. The primary purpose of the import keyword is to import other modules. > Since its meaning is tied to importing text files, it makes sense to return > a string. If you want to read files of arbitrary types, std.file.read should > be used. `import ...;` isn't tied to text files in any meaningful way. It imports a module, which may or may not be stored as text. It doesn't yield a string. std.file.read cannot be used during compilation like `import(...)`. > 2. This particular semantic of the import keyword is not widely spread. I, > personally, have never seen it in any of the D projects that I contributed > to. I don't think you personally using a feature is a good metric for accepting issues.
Comment #14 by aldacron — 2021-08-03T10:50:58Z
This gets at the bytes, does it not? enum bytes = import("foo.png").representation; I don't think we should change the behavior of the import expression at this point when the above achieves the same result.
Comment #15 by ag0aep6g — 2021-08-03T11:38:47Z
(In reply to Mike Parker from comment #14) > This gets at the bytes, does it not? > > enum bytes = import("foo.png").representation; Sure. But it only works because DMD happily creates an invalid string from `import("foo.png")` which is unusable unless reinterpreted. Arguably, it should throw an error instead. > I don't think we should change the behavior of the import expression at this > point when the above achieves the same result. Having a workaround is not a reason to keep the current behavior. The one argument for not addressing the issue as requested is breakage. If the type of `import(...)` is changed to ubyte[], the current uses will have to be updated. But if RazvanN's assertion that the feature isn't being used all that much is correct, then breakage doesn't matter all that much, does it?
Comment #16 by aldacron — 2021-08-03T11:56:47Z
Using 'representation' is not a workaround, anymore than would be converting a `ubyte[]` to a string. No matter what the expression returns, it will need to be converted to the format you need it in if that isn't the one you get. And I have no idea how often it's used in the wild. But it's been the way it is for years and it seems arbitrary to change it now. Given that we have a way to easily get a `ubyte[]` from it, I just don't see that changing its behavior is worth the potential breakage. I do agree that an array of bytes would be more sensible, but I just think that ship has sailed. What would be interesting would be to enhance it such that a type can be specified as part of the expression. Maybe something like `import("foo.png", ubyte[])`.
Comment #17 by ag0aep6g — 2021-08-03T12:56:12Z
(In reply to Mike Parker from comment #16) > Using 'representation' is not a workaround, anymore than would be converting > a `ubyte[]` to a string. No matter what the expression returns, it will need > to be converted to the format you need it in if that isn't the one you get. I don't care what you call it. My "workaround" is your "that's just way to do it". Point is: Having another way to get the desired type doesn't justify `import(...)` returning the other one. This goes both ways. Returning string or ubyte[] has to be justified in other ways. > And I have no idea how often it's used in the wild. But it's been the way it > is for years and it seems arbitrary to change it now. Given that we have a > way to easily get a `ubyte[]` from it, I just don't see that changing its > behavior is worth the potential breakage. Age ("it's been that way for years") is not a reason to not fix an issue. We're still left with breakage as the one reason to not fix this. And that's a vague one. You guys either "have no idea" how widely it is used, or you (RazvanN) believe that it is not used widely. On what basis do you assert that breakage would be too large? As I see it, RazvanN has been looking for old issues that can be closed easily. And now you're working backwards to justify the questionable decision. > I do agree that an array of bytes would be more sensible, but I just think > that ship has sailed. What would be interesting would be to enhance it such > that a type can be specified as part of the expression. Maybe something like > `import("foo.png", ubyte[])`. I can't say I like it, but it looks like a workable compromise. Special-casing `cast(...) import(...)` in the spec could also work.
Comment #18 by razvan.nitu1305 — 2021-08-03T13:17:17Z
(In reply to ag0aep6g from comment #17) > (In reply to Mike Parker from comment #16) > I don't care what you call it. My "workaround" is your "that's just way to > do it". Point is: Having another way to get the desired type doesn't justify > `import(...)` returning the other one. This goes both ways. Returning string > or ubyte[] has to be justified in other ways. > > > > Age ("it's been that way for years") is not a reason to not fix an issue. > This is not a bug, as the spec is perfectly clear on what an import expression returns [1]. It is at most an enhancement report. [1] https://dlang.org/spec/expression.html#import_expressions > We're still left with breakage as the one reason to not fix this. And that's > a vague one. You guys either "have no idea" how widely it is used, or you > (RazvanN) believe that it is not used widely. On what basis do you assert > that breakage would be too large? > I haven't seen a single use of it in 5 years of roaming around D repos. That doesn't mean it is not used at all. It means that it is an obscure feature, for which there are more flexible alternatives. Therefore, changing this amounts to making a low impact change that risks on breaking some code for (as much as I can tell) no benefit at all. > As I see it, RazvanN has been looking for old issues that can be closed > easily. And now you're working backwards to justify the questionable > decision. Adding a fix for this in the compiler is a trivial task. That is not the point here, but rather not to change something that has low impact, has library alternatives and might be used by people as is. > > > > I do agree that an array of bytes would be more sensible, but I just think > > that ship has sailed. What would be interesting would be to enhance it such > > that a type can be specified as part of the expression. Maybe something like > > `import("foo.png", ubyte[])`. > > I can't say I like it, but it looks like a workable compromise. > Special-casing `cast(...) import(...)` in the spec could also work. The spec is perfectly clear that a string literal is returned.
Comment #19 by aldacron — 2021-08-03T13:19:45Z
(In reply to ag0aep6g from comment #17) > Age ("it's been that way for years") is not a reason to not fix an issue. Perhaps if the issue is a bug, e.g., unintended behavior, but that is not the case here. The longer a feature has existed, working as intended, the bigger the chance people will have come to depend on it. > > We're still left with breakage as the one reason to not fix this. And that's > a vague one. You guys either "have no idea" how widely it is used, or you > (RazvanN) believe that it is not used widely. On what basis do you assert > that breakage would be too large? I didn't say it would be large. I said there is a *potential* for breakage, and that's based on the fact that the feature has been around for a very long time. I know that Walter and Atila generally don't want to break code unless there is a strong benefit to doing so. The fact that .representation is available minimizes the benefit. So the onus is on someone who wants this feature badly enough to come up with a benefit that justifies the potential breakage and outweighs the existing alternatives. That goes for changing any existing feature. > > As I see it, RazvanN has been looking for old issues that can be closed > easily. And now you're working backwards to justify the questionable > decision. I'm not "working backwards". I genuinely believe it shouldn't be changed. Regardless, it's ultimately not up to me and Razvan. If Walter and Atila think it's a good idea, that's all that matters. I do think Razvan is justified in closing the issue, though, as I would expect this sort of change to require a DIP, a preview switch, and a deprecation cycle. At the very least, someone should scan the dub repository to get an idea of how much this is used in the wild. And D shops should be queried on their use of it. I've personally used it for code that's not on github (embedding an image), and I had no problem using .representation. Didn't think twice about it. I also would have no problem changing my code if the proposed change were implemented, but I know there are plenty of D users who are generally averse to any sort of breakage. But please, if you feel strongly enough about it, take it to the forums and see what people think. Walter and/or Atila would probably weigh in on such a discussion.
Comment #20 by ag0aep6g — 2021-08-03T13:42:54Z
(In reply to RazvanN from comment #18) > (In reply to ag0aep6g from comment #17) [...] > > Age ("it's been that way for years") is not a reason to not fix an issue. > > > > This is not a bug, as the spec is perfectly clear on what an import > expression returns [1]. It is at most an enhancement report. I didn't write "bug", I wrote "issue". This issue is categorized as a bug, though. You didn't change that. [...] > Adding a fix for this in the compiler is a trivial task. That is not the > point here, but rather not to change something that has low impact, has > library alternatives and might be used by people as is. There is a difference between prioritizing high impact changes and outright refusing to do or even accept low impact changes. The former is fine. The latter is declaring bankruptcy. [...] > > I can't say I like it, but it looks like a workable compromise. > > Special-casing `cast(...) import(...)` in the spec could also work. > > The spec is perfectly clear that a string literal is returned. Yeah, we all know that. But a way to get something else would be nice.
Comment #21 by ag0aep6g — 2021-08-03T14:08:07Z
(In reply to Mike Parker from comment #19) > I didn't say it would be large. I said there is a *potential* for breakage, > and that's based on the fact that the feature has been around for a very > long time. Ok, we're in agreement. Breakage generally gets worse with age. But age itself isn't relevant, breakage is. [...] > But please, if you feel strongly enough about it, take it to the forums and > see what people think. Walter and/or Atila would probably weigh in on such a > discussion. I can accept avoiding breakage as the reason to close this as WONTFIX. Please note that I was the one who brought it up. RazvanN didn't mention it when closing. I don't think either of you made a good case for closing, but I don't care enough about this issue in particular and D in general anymore to invest more energy. Note that I have not reopened the issue.
Comment #22 by hsteoh — 2021-08-03T16:02:47Z
I use this feature from time to time, and I disagree that the primary use is to import modules. That's confusing it with the other use of the `import` keyword which has nothing to do with this feature, which is to read a file at compile-time and process it at compile-time. You cannot use std.file.read at compile-time, BTW. One use case I have is to embed the contents of a file in a D program. (N.B. not to be confused with loading a file at runtime.) Since the compiler cannot possibly tell whether a file contains text or binary data (or, for that matter, non-UTF-8 text data), it should not assume the data can be sanely interpreted as `string`. Therefore, ubyte[] is the appropriate return type, not `string`. (Though admittedly, this can be remedied with a cast to ubyte[]. It's ugly, nonetheless.) Finally, I disagree with closing this bug simply because it's old. Walter himself often states that bugs should not be closed merely because they are old or have not been receiving attention. A bug is a bug, regardless of age or attention, and should be left open as long as the issue still exists. We should not optimizing on metrics in place of actually improving the language / compiler. Unless Walter/Atila has explicitly decided that this issue will not be fixed, it should remain open.
Comment #23 by robert.schadek — 2024-12-13T17:48:24Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17694 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB