Bug 4474 – Better stdin.byLine()

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
All
OS
All
Creation time
2010-07-16T16:21:00Z
Last change time
2015-11-03T14:25:59Z
Assigned to
nobody
Creator
bearophile_hugs

Comments

Comment #0 by bearophile_hugs — 2010-07-16T16:21:32Z
This is relative to page 16-17 of The D Programming Language. It explains stdin.byLine() and possible 'rather hard to find' bugs caused by not duplicating the input data. If I use D to write 20-lines long scripts I really don't want to remember to dup all things (in D1 code I sometimes end up dupping too much, to be on the safe side). So I suggest a different API for the line reading: - stdin.byLineMutable() (or another similar name, longer than "byLine" that makes it clear it doesn't copy): for the current behaviour that avoids a memory allocation for each line read. This is faster but it's less safe. - stdin.byLine(): that allocates a new string for each line, this is safer, as in Python (Python also uses heuristics to speed up this method as much as possible, because this is often a very common and performance-critical operation in scripts). All D default design policy says that unsafe but faster things need to be asked for, and the default things must be less bug-prone. If I write a small D script I can use byLine(), hoping to avoid some bugs. If later I see profiling shows me it's too much slow, I can replace the byLine() with the other method and optimize the code, carefully, removing some heap allocations. (An alternative design strategy is to keep just the byLine() method, but give it an optional default argument, like stdin.byLine(bool copy=True) or stdin.byLine(bool COPY=True)(), that on default copies the line with a new memory allocation.)
Comment #1 by andrei — 2010-07-17T08:00:52Z
byLine is safe.
Comment #2 by bearophile_hugs — 2010-07-17T08:29:20Z
OK, changed title in "Better" instead of "Safer".
Comment #3 by bearophile_hugs — 2010-07-17T09:10:49Z
This is a small test program (dmd v2.047): import std.string, std.stdio; void main() { int[string] aa; foreach (line; stdin.byLine()) foreach (word; line.split()) aa[word]++; foreach (word, freq; aa) writeln(freq, " ", word); } Running with itself as input data: test.exe < test.d Prints: 1 eln(fr 1 q, " ", wo 1 writeln 1 } 1 " 1 } 1 } 1 writeln 2 wri 1 wri 1 ", word); )) 1 , w 1 q, " ", word); 1 eln(fr 1 q, " 1 freq, 1 ", 1 eln(freq, " 1 writeln(fr 1 word); 1 writeln(freq, 1 fre 1 e This shows that byLine() is bug-prone (unsafe). While this program: import std.string, std.stdio; void main() { int[string] aa; foreach (line; stdin.byLine()) foreach (word; line.split()) aa[word.dup]++; foreach (word, freq; aa) writeln(freq, " ", word); } Prints a more correct output: 1 (word, 1 std.stdio; 1 int[string] 1 } 1 " 1 void 1 import 3 foreach 1 main() 1 aa) 1 line.split()) 1 stdin.byLine()) 1 (line; 1 freq; 1 (word; 1 ", 1 std.string, 1 word); 1 writeln(freq, 1 aa[word.dup]++; 1 aa; 1 { It's easy to forget dupping/idupping.
Comment #4 by andrei — 2010-07-17T11:06:02Z
That example is the manifestation of another bug: http://d.puremagic.com/issues/show_bug.cgi?id=2954
Comment #5 by bearophile_hugs — 2010-07-17T11:46:28Z
If you think this bug report is invalid and byLine() is safe (because the type system is enough, being able to tell apart char[] and string), then you can close this bug report.
Comment #6 by bearophile_hugs — 2010-07-24T19:07:33Z
Bug closed because Andrei says byLine() is safe :-)
Comment #7 by bearophile_hugs — 2014-05-21T12:05:08Z
Reopened. byLine can't be renamed byLineMutable, so byLine is the noncopying one and byLineCopy is the copying one. So here the "default" short functions is unfortunately the less safe one, against the D Zen: https://github.com/D-Programming-Language/phobos/pull/2077