Bug 2093 – string concatenation modifies original

Status
RESOLVED
Resolution
FIXED
Severity
normal
Priority
P2
Component
dmd
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2008-05-10T14:30:00Z
Last change time
2015-06-09T01:21:35Z
Keywords
accepts-invalid, spec
Assigned to
nobody
Creator
bartosz
Blocks
2573, 2095

Attachments

IDFilenameSummaryContent-TypeSize
256xml.dTest casetext/plain13577

Comments

Comment #0 by bartosz — 2008-05-10T14:30:04Z
I will attach source code for this example. It's an XML parser. It should produce the following output: c:\D\Work>xml root child color=red Text=foo bar baz Instead it produces this: c:\D\Work>xml root rootd rootd=red Text=rootdar baz The problem is that strings are modified after being copied, when the original is concatenated upon. The problem goes away if I idup strings: _name = name.idup; _value = value.idup; or when I replace a ~= b; with a = a ~ b;
Comment #1 by bartosz — 2008-05-10T14:31:59Z
Created attachment 256 Test case
Comment #2 by smjg — 2008-11-21T19:09:10Z
Welcome to the world of bug reporting. The way to report a bug isn't to attach a 695-line program that contains some functionality somewhere that exhibits the problem. The correct manner is to post a small example that illustrates the problem, typically either by writing a test program from scratch or by simplifying little by little the program in which you found it. If done well, the result will be small enough to post straight into the bug report rather than attaching it. DMD's code coverage analysis is a useful tool for identifying unused parts of a program in order to cut them out, among other things.
Comment #3 by smjg — 2008-11-21T20:40:31Z
I think I've finally managed to figure out what was going on. ---------- import std.stdio; void main() { string s1, s2; s1 ~= "hello"; s2 = s1; writefln(s1); writefln(s2); s1.length = 0; s1 ~= "Hi"; writefln(s1); writefln(s2); } ---------- hello hello Hi Hillo ---------- This is the kind of testcase we like here. Walter is more likely to fix a bug if you make life easier for him by supplying something on which the cause can easily be seen.
Comment #4 by 2korden — 2008-11-22T00:39:26Z
This is a known bug and is a major array design flow. Arrays has no determined owner (the only one who can grow without a reallocation if capacity permits): import std.stdio; void main() { char[] s1, s2; s1.length = 100; // reserve the capacity s1.length = 0; s2 = s1; // both are pointing to an empty string with the capacity of 100 s1 ~= "Hello"; // array is not reallocated, it is grown in-place writefln(s1); writefln(s2); // prints empty string. s2 still points to the same string (which is now "Hello") and carries length of 0 s2 ~= "Hi"; // overwrites s1 writefln(s2); // "Hi" writefln(s1); // "Hillo" } s1 is the array owner and s2 is a slice (even though it really points to the entire array), i.e. it should reallocate and take the ownership of the reallocated array on append, but it doesn't happen. Currently an 'owner' is anyone who has a pointer to array's beginning: char[] s = "hello".dup; char[] s1 = s[0..4]; s1 ~= "!"; assert(s != s1); // fails, both are "hell!", s is overwritten s = "_hello".dup; char[] s2 = s[1..5]; s2 ~= "!"; assert(s != s1); // succeeds, s1 is not changed
Comment #5 by ddparnell — 2008-11-22T03:58:02Z
I thought 'string' types were immutable and thus ... s1.length = 0; should fail as it updates the string (trucates it to zero characters).
Comment #6 by 2korden — 2008-11-22T06:45:16Z
No, string is aliased to invariant(char)[], i.e. an array of invariant characters. You can change its length (usually, decreasing) but not contents.
Comment #7 by smjg — 2008-11-22T08:43:18Z
(In reply to comment #4) > Currently an 'owner' is anyone who has a pointer to array's beginning: > > char[] s = "hello".dup; > char[] s1 = s[0..4]; > s1 ~= "!"; > assert(s != s1); // fails, both are "hell!", s is overwritten A simple char[] is fully mutable, so that doesn't violate any established rule, but whether it's desirable is another matter. With const(char)[] or invariant(char)[], obviously this isn't going to work, so ~= should always reallocate (unless the optimiser can be sure that no other reference to the data can possibly exist). Alternatively, the GC could maintain a note of the actual length of every heap-allocated array. Ownership would be determined by matching in both start pointer and length. When the length is increased, whether by .length or ~=, either update this actual length (if it's the owner that we're extending, IWC all other references to the same data lose ownership) or reallocate the array.
Comment #8 by schveiguy — 2008-11-22T09:44:41Z
Note that this behavior is defined in the spec. See http://www.digitalmars.com/d/2.0/arrays.html#resize "To maximize efficiency, the runtime always tries to resize the array in place to avoid extra copying. It will always do a copy if the new size is larger and the array was not allocated via the new operator or a previous resize operation. This means that if there is an array slice immediately following the array being resized, the resized array could overlap the slice" The fact that it violates invariantness is a side effect that Walter has not yet dealt with. There have been proposals to fix this, two of which I have proposed: 1. As you said, store the requested length along with the block length in the GC. Only appending to an array that ends at the end of the allocated memory will realloc in place. Original proposal: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=63146 Note nobody responded to this one 2. Store the length of the allocated array in the first element of the array. Then modify the meaning of the length member of the array struct to flag whether it is pointing to the beginning of the array or not. Original proposal: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=77437 Some people had questions, but nobody proved the proposal wouldn't work. I don't think Walter is interested in fixing this issue, as it has been a 'feature' for a while, and he never has responded positively to any decent proposals to fix this.
Comment #9 by smjg — 2008-11-22T10:39:21Z
Steven, you have no authority to mark this WONTFIX.
Comment #10 by schveiguy — 2008-11-22T12:42:01Z
Sorry, I was thinking wontfix because the compiler functions as designed. I marked it as an enhancment instead. Removing wrong-code keyword also, as this is intended behavior.
Comment #11 by smjg — 2008-11-22T13:34:37Z
So the problem is that it _always_ leaves the decision to resize in place or reallocate to the runtime. The only way in which this can coexist with the principle of invariant is that it is illegal to increase the length of a const/invariant array. Therefore, going by the current spec, the bug is that DMD accepts the code.
Comment #12 by ddparnell — 2008-11-22T14:52:51Z
It seems to me then that this is a design choice - does the string length belong to the string or to the reference? For slices it must be the reference but for arrays? hmmm... Curently in D, a dynamic array and a slice are indistinguishable and I'm not so sure that should be the case. There are good arguments for the current design and also for the separation of slices and dynamic arrays. Common sense seems to say that if I change the length of a string that therefore every other reference to the same string should also honour the new length, and that this should also have no effect on previously captured slices of the string.
Comment #13 by schveiguy — 2008-11-22T16:31:40Z
(In reply to comment #12) > It seems to me then that this is a design choice - does the string length > belong to the string or to the reference? For slices it must be the reference > but for arrays? hmmm... Curently in D, a dynamic array and a slice are > indistinguishable and I'm not so sure that should be the case. There are good > arguments for the current design and also for the separation of slices and > dynamic arrays. > > Common sense seems to say that if I change the length of a string that > therefore every other reference to the same string should also honour the new > length, and that this should also have no effect on previously captured slices > of the string. Arrays should not be typed differently than slices IMO, they should be able to be passed to the same functions. I think one of the two solutions I proposed would place the 'allocated length' of an array on the heap with the array data, thereby having the length stored in a shared location. Slices should respect this length, and if they cannot see the length, they should be reallocated as a full-blown array.
Comment #14 by dfj1esp02 — 2009-02-19T03:04:09Z
see also bug 2095 comment 6
Comment #15 by schveiguy — 2010-03-16T19:55:38Z
This is fixed by the patch in bug 3637. It is in dmd 2.041. Compiling the attached file results in the desired output.