Comment #0 by heiko.honrath — 2015-11-05T20:25:27Z
Created attachment 1561
ASCII-textfile which throws UTF8 exception
File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence
on pure ASCII file
.map!toLower throws at a different location with the same file
If neither toLower nor toUpper are used, the file is read completely.
program generator.d:
module generator;
// 2015-11-05
// dmd2.069
// built with: rdmd --build-only --force -version=WindowsXP -version=Unicode -release -O generator.d
// to reproduce run: generator gentest.txt
import std.stdio, std.algorithm, std.string, std.format;
void main(string[] args) {
int line_number = 1;
auto gen = File(args.length > 1 ? args[1] : "gentest.txt")
.byLine()
.map!chomp
.map!toUpper
//~ .map!toLower
.map!(line => format("Line %4s: %s", line_number++, line));
writefln("%-(%s\n%)", gen);
}
Comment #1 by heiko.honrath — 2015-11-05T20:27:32Z
Created attachment 1562
The D source code generator.d
The D source code file generator.d
Comment #2 by ag0aep6g — 2015-11-05T22:57:08Z
A first reduction:
----
static import std.file;
import std.stdio: File;
import std.string: toUpper;
void main()
{
std.file.write("gentest.txt", "aaaaaaaa\naaaaaaa\n00000000a");
foreach(ln; File("gentest.txt").byLine())
{
toUpper(ln);
}
}
----
Works with 2.068.2.
Comment #3 by heiko.honrath — 2015-11-06T12:22:37Z
Thanks for this reduction!
It also crashes with:
static import std.file;
import std.stdio: File, writeln;
import std.string: toUpper, toLower;
void main()
{
std.file.write("gen.txt", "12345678\n1234567\n12345678a");
foreach(ln; File("gen.txt").byLine())
{
writeln(ln);
writeln("> ", toLower(ln));
}
foreach(ln; File("gen.txt").byLine())
{
writeln(ln);
writeln("> ", toUpper(ln));
}
}
If the last character is lowercase then toUpper crashes.
If it is uppercase then toLower crashes.
Comment #4 by heiko.honrath — 2015-11-06T14:59:57Z
The shortest reductions I could find are with:
std.file.write("gen.txt", "12\n\n1x\n");
or
std.file.write("gen.txt", "1\n\n1x");
So it does not depend on the lacking \n at the end.
Comment #5 by ag0aep6g — 2015-11-06T17:01:11Z
Reduced further:
----
static import std.file;
import std.stdio: File, writeln;
void main()
{
std.file.write("gen.txt", "a\n\naa");
auto file = File("gen.txt");
char[] buffer;
char[] line;
file.readln(buffer, '\n');
line = buffer;
file.readln(line, '\n');
line = buffer;
file.readln(line, '\n');
writeln(line.capacity); /* "0" */
writeln(line[0 .. 1].capacity); /* "255" -- nonsense */
}
----
This reminds me more and more of issue 13856.
(In reply to ag0aep6g from comment #5)
> writeln(line.capacity); /* "0" */
> writeln(line[0 .. 1].capacity); /* "255" -- nonsense */
Not nonsense.
When you are passing in empty buffer, it is allocated to 256 bytes with a capacity of 255.
The second call is given a buffer which readln can see that it has full control over (it has capacity, which means all data after it is unused). Since passing a buffer into readln disowns the data (readln now owns and can resize the used portion down if necessary), the error is continuing to use 'buffer' as the buffer -- buffer contains 2 bytes, the second byte is essentially garbage after assumeSafeAppend is called since it's not part of the capacity. readln is meant to be passed in the same buffer over and over, this is why it uses assumeSafeAppend.
Doing that, the final readln call sees that it has no capacity, so it treats this as an un-extendable buffer. It can't resize the buffer, nor claim that some of the buffer space is now usable (if the full line fits in less bytes). Basically, it treats this like a stack buffer.
On the last read, the line is shorter, but assumeSafeAppend isn't called because it doesn't know where the buffer comes from!
> This reminds me more and more of issue 13856.
That issue was fixed by the addition of the ReadlnAppender.
Also, I'll note that on my test of using dmd 2.069.0 or 2.069.1 on Macos X, I do not get an error with the first reduction, or HeiHon's additional issue. The changes you have in your pull are not windows specific, so I'm not sure if they are required to fix this issue.
Comment #8 by ag0aep6g — 2015-11-12T15:56:59Z
(In reply to Steven Schveighoffer from comment #7)
> Since passing a buffer into readln disowns the data (readln now owns and can
> resize the used portion down if necessary), the error is continuing to use
> 'buffer' as the buffer -- buffer contains 2 bytes, the second byte is
> essentially garbage after assumeSafeAppend is called since it's not part of
> the capacity. readln is meant to be passed in the same buffer over and over,
> this is why it uses assumeSafeAppend.
Sorry, but that's ridiculous. It's unnecessarily unsafe, overly complicated, and not documented. The other readlnImpl variants don't work like that, and it apparently wasn't thought of when byLine was implemented.
If readln wants to take the same buffer over and over, then it should not shrink it, but return a slice of it. But breaking the API is probably not an option now. So we're stuck with either 1) fundamentally flawed, surprising behavior, or 2) more allocations when readln is used naively.
In my opinion, 2 is the way to go.
Also, byLine is nicer anyway, and doesn't have the problem. And if std.io ever gets done, we have an opportunity not to fall in the same trap.
Comment #9 by schveiguy — 2015-11-12T16:25:30Z
The behavior is expected, and actually intentional.
The original behavior was that regardless of current capacity, if a GC buffer was passed, ALL the space was claimed by readln.
In this case, what is passed is not a buffer that ends on an appendable boundary, and so it chooses to do nothing as far as updating the appendability (the right choice).
The reason it was done this way is to allow for a common usage of readln (namely, reading into a buffer over and over again) to continue to have good performance. If we took the "safe" route (and believe me, it was debated and considered), then this code would reallocate every time a larger line was read. When it was originally written, this is what byLine did, but now it's changed.
> The other readlnImpl variants don't work like that
looking at it now, the other variants do exactly as I said above -- they reallocate when the line is longer.
So technically we could move to this, we just have to accept a drop in performance for the code.
Have you tested the performance of your new code vs. the current implementation?
Comment #10 by heiko.honrath — 2015-11-12T16:43:51Z
(In reply to Steven Schveighoffer from comment #9)
> When it was originally written, this is what byLine did, but now it's
> changed.
So probably the change to byLine introduced the problem?
I tested my examples on 2.068.2 and some versions before.
No problem there on Windows.
Also no problem with byLineCopy - neither with 2.068.2 nor 2.069.
Comment #11 by schveiguy — 2015-11-12T16:53:57Z
(In reply to HeiHon from comment #10)
> (In reply to Steven Schveighoffer from comment #9)
> > When it was originally written, this is what byLine did, but now it's
> > changed.
>
> So probably the change to byLine introduced the problem?
No, what I meant by that is that byLine simply called readln over and over with the same buffer. So killing the performance of this mechanism would kill the performance of byLine.
However, now byLine always passes in the same buffer without caring what readln did. So the concern that made this whole thing complicated sort of went away :)
> I tested my examples on 2.068.2 and some versions before.
> No problem there on Windows.
The readln updates were released in 2.069, so this makes sense.
Comment #12 by ag0aep6g — 2015-11-12T16:55:12Z
(In reply to Steven Schveighoffer from comment #9)
> The reason it was done this way is to allow for a common usage of readln
> (namely, reading into a buffer over and over again) to continue to have good
> performance.
I understand that, and I think that goal is not compatible with having sane behavior. In my opinion, calling assumeSafeAppend on a smaller slice is not acceptable when larger slices may exist in the outside world.
> If we took the "safe" route (and believe me, it was debated and
> considered), then this code would reallocate every time a larger line was
> read. When it was originally written, this is what byLine did, but now it's
> changed.
>
> > The other readlnImpl variants don't work like that
>
> looking at it now, the other variants do exactly as I said above -- they
> reallocate when the line is longer.
Which is simply the right thing to do. With the API of readln I think we just have to accept the allocations.
> So technically we could move to this, we just have to accept a drop in
> performance for the code.
>
> Have you tested the performance of your new code vs. the current
> implementation?
No. I'm sure there is going to be a performance drop. I don't think that it matters, though. Correctness beats speed.
It also only affects naive readln calls on Windows. byLine is already cautious about what exactly it gives readln as a buffer, so it shouldn't get slower. And on other OSs readnl already does the excessive allocations.
Comment #13 by schveiguy — 2015-11-12T18:16:41Z
(In reply to ag0aep6g from comment #12)
> It also only affects naive readln calls on Windows. byLine is already
> cautious about what exactly it gives readln as a buffer, so it shouldn't get
> slower. And on other OSs readnl already does the excessive allocations.
I think you're right. It seems that the naive loop is simply bad code, and we shouldn't cater to it. It also makes using a stack buffer as a starting point very unwieldy. Since byLine is no longer affected by such a change, I think we should do it. There is already an example in the code to show how to properly use readln in a loop (added as a result of the previous update).
Thanks
Comment #14 by github-bugzilla — 2015-11-13T15:40:44Z
Commits pushed to stable at https://github.com/D-Programming-Language/phoboshttps://github.com/D-Programming-Language/phobos/commit/994d6b81815bc70ab8507ece7285b49ef5ce6d2d
fix issue 15293
ReadlnAppender tried to claim the capacity of the passed buffer, calling
assumeSafeAppend on the result so that on the next call it has a capacity
again that can be claimed.
The obvious problem with that: readln would stomp over memory that it has
not been given.
There was also a subtler problem with it (which caused issue 15293):
When readln wasn't called with the previous line, but with the original
buffer (byLine does that), then the passed buffer had no capacity, so
ReadlnAppender would not assumeSafeAppend when slicing the new line from
it. But without a new assumeSafeAppend, the last one would still be in
effect, possibly on a sub slice of the new line.
https://github.com/D-Programming-Language/phobos/commit/fc77dbbfa93d126c5dfec7c03cc8939b819c09a9
Merge pull request #3802 from aG0aep6G/15293
fix issue 15293
Comment #15 by github-bugzilla — 2016-01-03T14:13:06Z