Bug 17229 – File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow)

Status
RESOLVED
Resolution
FIXED
Severity
major
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2017-02-27T08:40:24Z
Last change time
2018-01-05T13:27:25Z
Keywords
pull
Assigned to
No Owner
Creator
Jon Degenhardt

Comments

Comment #0 by jrdemail2000-dlang — 2017-02-27T08:40:24Z
Using File.byChunk to read and write with stdout.lockingTextWriter is very slow. Dramatically slower (15x) than the same activity with File.byLine. Not clear if there's real connection between File.byChunk and stdout.lockingTextWriter, but for other operations that read and access the data without writing File.byChunk is faster than File.byLine. ---Copy file byChunk code--- auto chunkedStream = filename.File.byChunk(1024*1024); auto stdoutWriter = stdout.lockingTextWriter; chunkedStream.each!(x => put(stdoutWriter, x)); ---Copy file byLine code--- auto chunkedStream = filename.File.byLine(Yes.keepTerminator); auto stdoutWriter = stdout.lockingTextWriter; chunkedStream.each!(x => put(stdoutWriter, x)); The above in a simple main program, copying a 2.7 GB, 14 million file has following times (ldc 1.1 -release -O -boundscheck=off): byLine: 2.09 seconds byChunk: 35.24 seconds A 17x delta. I tried a number of different formulations of the code, it had the same each time. Changing the program to read and access the data without writing changes, things so that byChunk is faster. ---Count 9's byChunk code fragment--- auto chunkedStream = filename.File.byChunk(1024*1024); size_t count = 0; chunkedStream.each!(x => count += x.count('9')); writefln("Found %d '9's", count); ---Count 9's byLine code fragment--- auto chunkedStream = filename.File.byLine(Yes.keepTerminator); size_t count = 0; chunkedStream.each!(x => count += x.count('9')); writefln("Found %d '9's", count); Results for the count 9's program, against the 2.7, 14 million line file: byLine: 8.98 seconds byChunk: 1.64 seconds Different formulations of the above have the same result, including the same formulations in the byChunk documentation. The above suggests that reading with File.byChunk may not problematic by itself, but that the slow writing is somehow connected. ---Full program used for byChunk--- import std.algorithm; import std.range; import std.stdio; void main(string[] cmdArgs) { if (cmdArgs.length < 2) { writeln("synopis: copyfile_bychunk file"); } else { auto filename = cmdArgs[1]; auto chunkedStream = (filename == "-") ? stdin.byChunk(1024*1024) : filename.File.byChunk(1024*1024); auto stdoutWriter = stdout.lockingTextWriter; chunkedStream.each!(x => put(stdoutWriter, x)); } } The other test programs were written similarly. Tests were on OS X.
Comment #1 by jrdemail2000-dlang — 2017-02-27T23:51:20Z
(In reply to Jon Degenhardt from comment #0) > Results for the count 9's program, against the 2.7, 14 million line file: > byLine: 8.98 seconds > byChunk: 1.64 seconds Update: The byLine Count 9s program given above is being affected by autodecode. Changing counting line as follows: from: chunkedStream.each!(x => count += x.count('9')); to: chunkedStream.each!(x => count += (cast(ubyte[])x).count('9')); Gives updated times: byLine: 3.13 byChunk: 1.64 This is more consistent. byChunk is faster than byLine in the Count 9s task (read and access data), by not by 5x. The 15x performance deficit of byChunk relative to byLine on the file copy task remains.
Comment #2 by schveiguy — 2017-02-28T16:45:36Z
The issue is that LockingTextWriter accepts anything but character range. What is happening (I think) is that lockingTextWriter.put(someUbyteArray) is actually putting each ubyte into the stream one at a time after converting each ubyte into a dchar through integer promotion. This can't be the intention. I think instead of the constraint is(ElementType!A : const(dchar)) what we need is something more like: is(Unqual!(ElementType!A) == dchar) This might break code, but I think code that is already broken, such as the example, no?
Comment #3 by jrdemail2000-dlang — 2017-03-01T09:05:19Z
I've confirmed that File.byChunk with lockingTextWriter corrupts utf-8 encoded files. I used the unicode test file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt and the example given with the File.byChunk documentation: // Efficient file copy, 1MB at a time. import std.algorithm, std.stdio; void main() { stdin.byChunk(1024 * 1024).copy(stdout.lockingTextWriter()); } This file copy program corrupts the unicode characters as described in Steven's comment. This is a quite problematic, both because of character corruption and because it is an example in the documentation. The new method, lockingBinaryWriter, copies the file correctly. It is available starting with 2.073.1. lockingBinaryWriter also copies the file quickly, eliminating the performance issue. It is appears from the PR for lockingBinaryWriter (https://github.com/dlang/phobos/pull/2011) that there was discussion of the roles of Binary and Text writer. Regardless of availability of the lockingBinaryWriter, the lockingTextWriter certainly looks broken when used with the ubyte data type. Personally, I think it makes sense for lockingTextWriter to assume ubyte arrays are correctly encoded, or perhaps are utf-8 encoded. This would potentially allow newline translation, something that the lockingBinaryWriter would presumably not do.
Comment #4 by jrdemail2000-dlang — 2017-03-01T09:11:22Z
Changing the subject to reflect the more serious problem, corruption of utf-8 encoded data. As described in one of the comments, this corruption occurs in a file copy example in the documentation.
Comment #5 by dfj1esp02 — 2017-03-01T10:28:40Z
static assert(is(ubyte:dchar)); This assert succeeds. Is it intended? It's why LockingTextWriter accepts bytes event though it's a text writer.
Comment #6 by schveiguy — 2017-03-01T14:25:09Z
(In reply to anonymous4 from comment #5) > static assert(is(ubyte:dchar)); > This assert succeeds. Is it intended? It's why LockingTextWriter accepts > bytes event though it's a text writer. Yes, in the compiler, ubyte and dchar are fundamentally unsigned integer types. You can implicitly convert via integer promotion. Technically, char can integer promote to dchar as well. However, foreach(dchar d; someCharArray) is specialized in the compiler to go through auto decoding. The forums contain volumes of battles on auto-decoding and how the choice has affected D, I don't think we need to rehash it here. What we need to do is make it so obviously wrong code is not accepted for this function. I'll try and get a PR together.
Comment #7 by schveiguy — 2017-03-02T03:33:35Z
Comment #8 by github-bugzilla — 2017-03-17T22:41:48Z
Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges to dchar ranges. https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229 https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter Fix issue 17229 - File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow) merged-on-behalf-of: Jack Stouffer <[email protected]>
Comment #9 by github-bugzilla — 2017-03-22T12:22:27Z
Commits pushed to stable at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229 https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter
Comment #10 by github-bugzilla — 2017-08-07T12:26:16Z
Commits pushed to newCTFE at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229 https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter
Comment #11 by github-bugzilla — 2018-01-05T13:27:25Z
Commits pushed to dmd-cxx at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229 https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter