Bug 13474 – Discard excess precision for float and double (x87)

Status
RESOLVED
Resolution
FIXED
Severity
enhancement
Priority
P1
Component
dmd
Product
D
Version
D2
Platform
All
OS
All
Creation time
2014-09-14T13:22:00Z
Last change time
2017-01-16T23:24:41Z
Keywords
wrong-code
Assigned to
nobody
Creator
ilyayaroshenko
See also
https://issues.dlang.org/show_bug.cgi?id=13485

Comments

Comment #0 by ilyayaroshenko — 2014-09-14T13:22:18Z
KBN summation return wrong result when compiled with dmd -m32 -O With -m64 dmd works correctly LDC works correctly both m32 and m64 +++++++++++++++++++++++++++++++++ /** Kahan-Babuška-Neumaier summation algorithm +++++++++ s := x[1] c := 0 FOR i := 2 TO n DO t := s + x[i] IF ABS(s) >= ABS(x[i]) THEN c := c + ((s-t)+x[i]) ELSE c := c + ((x[i]-t)+s) END IF s := t END DO s := s + c +++++++++ */ F sumKBN(Range, F = Unqual!(ForeachType!Range))(Range r, F s = 0.0) { F c = 0.0; foreach(F x; r) { F t = s + x; if(s.fabs >= x.fabs) { F y = s-t; c += y+x; } else { F y = x-t; c += y+s; } s = t; } return s + c; } unittest { import std.algorithm : map; auto ar = [1, 1e100, 1, -1e100].map!(a => a*10000); double r = 20000; assert(r == ar.sumKBN); // fails with dmd -m32 -O } +++++++++++++++++++++++++++++++++ Git Reference: https://github.com/D-Programming-Language/phobos/pull/2513
Comment #1 by ilyayaroshenko — 2014-09-15T17:39:09Z
//Fails both DMD -m32 and -m64 import core.stdc.tgmath; double sumKBN(double[] d) { double s = 0.0; double c = 0.0; foreach(double x; d) { double t = s + x; if(s.fabs >= x.fabs) c += (s-t)+x; else c += (x-t)+s; s = t; } return s + c; } unittest { assert(3 == [1, 1e100, 1, -1e100].sumKBN); }
Comment #2 by ilyayaroshenko — 2014-09-15T17:40:25Z
//Fails both DMD -m32 and -m64 import core.stdc.tgmath; double sumKBN(double[] d) { double s = 0.0; double c = 0.0; foreach(double x; d) { double t = s + x; if(s.fabs >= x.fabs) c += (s-t)+x; else c += (x-t)+s; s = t; } return s + c; } unittest { assert(3 == [1, 1e100, 1, -1e100].sumKBN); }
Comment #3 by ilyayaroshenko — 2014-09-15T17:52:16Z
Please Discard previous 2 comments. //$ dmd -m32 -O -main -unittest -run test4.d //[email protected](28): 0 import core.stdc.tgmath; double sumKBN(Range)(Range r) { double s = 0.0; double c = 0.0; foreach(double x; r) { double t = s + x; if(s.fabs >= x.fabs) c += (s-t)+x; else c += (x-t)+s; s = t; } return s + c; } unittest { import std.algorithm, std.range, std.conv; auto ar1 = [1, 1e100, 1, -1e100].map!(a => a*10000).array; auto ar2 = [1, 1e100, 1, -1e100].map!(a => a*10000); double r = 20000; assert(r == ar1.sumKBN); // Ok assert(r == ar2.sumKBN, ar2.sumKBN.to!string); // fails }
Comment #4 by ilyayaroshenko — 2014-09-15T20:43:58Z
Comment #5 by yebblies — 2015-02-08T08:05:18Z
Reduced: double change(double x) { return x * 10000; } void main() { double c = 0x1.388p+13; double s = 0x1.652efdc6018a2p+345; double t = s + 10000; c += s - t + 10000; s = t; auto x = change(-1e100); double u = s + x; c += x - u + s; s = u; assert(s + c == 20000); }
Comment #6 by ilyayaroshenko — 2015-02-08T09:12:40Z
F foo(F)(F c, F d) { c += d; c += d; return c; } void test1() { alias F = double; enum F d = (cast(F)(2)) ^^ (F.max_exp - 1); assert(foo(-d, d) == d); }
Comment #7 by yebblies — 2015-02-08T09:19:45Z
change(-1e100) calculates its result at real precision, then returns it in ST(0). It is duplicated via the stack (truncating to double) but the full precision value is then added with (1e104) and due to the extra precision they no longer cancel exactly, instead resulting in -1.3987216024025249e+0087. This value causes the later addition 10000 to be ignored, leading to the wrong result and the assertion failure. So, is it invalid to return a double in ST0 without first truncating it to double precision?
Comment #8 by yebblies — 2015-02-08T09:21:18Z
(In reply to Илья Ярошенко from comment #6) > F foo(F)(F c, F d) { > c += d; > c += d; > return c; > } > > void test1() { > alias F = double; > enum F d = (cast(F)(2)) ^^ (F.max_exp - 1); > assert(foo(-d, d) == d); > } That doesn't fail for me on win32 with -O.
Comment #9 by ilyayaroshenko — 2015-02-08T09:27:05Z
(In reply to yebblies from comment #8) > (In reply to Илья Ярошенко from comment #6) > > F foo(F)(F c, F d) { > > c += d; > > c += d; > > return c; > > } > > > > void test1() { > > alias F = double; > > enum F d = (cast(F)(2)) ^^ (F.max_exp - 1); > > assert(foo(-d, d) == d); > > } > > That doesn't fail for me on win32 with -O. Can you open https://github.com/D-Programming-Language/dmd/pull/3992 please?
Comment #10 by yebblies — 2015-02-08T10:19:48Z
(In reply to Илья Ярошенко from comment #9) > > Can you open https://github.com/D-Programming-Language/dmd/pull/3992 please? Why? The results are still visible here: https://auto-tester.puremagic.com/pull-history.ghtml?projectid=1&repoid=1&pullid=3992 I can reproduce that failure on linux64, so it might be an xmm bug. Not failing on win32 suggests it's a different bug.
Comment #11 by ilyayaroshenko — 2015-02-08T10:25:36Z
OK. This is DMD optimization bug too. DMD compiles (-d)+(d+d) instead of ((-d)+d)+d.
Comment #12 by yebblies — 2015-02-08T10:44:15Z
(In reply to Илья Ярошенко from comment #11) > OK. > This is DMD optimization bug too. > DMD compiles (-d)+(d+d) instead of ((-d)+d)+d. Yes. Is there a bug report for that specific bug?
Comment #13 by ilyayaroshenko — 2015-02-08T10:48:25Z
Comment #14 by ilyayaroshenko — 2015-02-09T12:40:18Z
(In reply to yebblies from comment #7) > change(-1e100) calculates its result at real precision, then returns it in > ST(0). It is duplicated via the stack (truncating to double) but the full > precision value is then added with (1e104) and due to the extra precision > they no longer cancel exactly, instead resulting in > -1.3987216024025249e+0087. > > This value causes the later addition 10000 to be ignored, leading to the > wrong result and the assertion failure. > > So, is it invalid to return a double in ST0 without first truncating it to > double precision? I think so. Looks like LDC and GDC have truncation.
Comment #15 by yebblies — 2015-02-16T05:02:38Z
I've thought about it some more, and I don't think this is behaving incorrectly given what D does specify about floating point behaviour. 1. Nothing in any ABI I can find specifies that return values must be truncated. (gcc does have a -ffloat-store flag) 2. D allows floating point intermediates to be at a higher precision than the operand types. This doesn't explicitly include return values, but... 3. D allows inlining, which means even if return values are explicitly excluded from (2) then it still can't be relied upon. So, I conclude that calculating and returning floating point values at a higher precision than specified is allowed in D. The code where you hit the problem can be fixed by: 1. Fixing the tests to not rely on maximum precision. 2. Using a wrapper on function calls where precision must be forced. (An intrinsic to force precision loss has been discussed in the past) 3. Lobbying Walter for a language change enforcing this in some way. I've changed the platform as this behaviour is possible on other platforms too even if the current code generator never hits those cases.
Comment #16 by yebblies — 2015-02-16T05:08:16Z
Here is the other bug which is relevant: issue 9937.
Comment #17 by ilyayaroshenko — 2015-02-16T06:23:17Z
(In reply to yebblies from comment #15) Why you have changed the name of this issue? Bug is not only for return. Truncation should be during all calculations. > I've thought about it some more, and I don't think this is behaving > incorrectly given what D does specify about floating point behaviour. Yes. D is unique language in this question. In other other words D is not a system language because it can't compile correctly IEEE 754 standard. I am disappointed. All D contributors explain me that "it is a feature for precise math". But this feature prohibits price math: https://github.com/D-Programming-Language/phobos/pull/2513. > 1. Nothing in any ABI I can find specifies that return values must be > truncated. (gcc does have a -ffloat-store flag) > > 2. D allows floating point intermediates to be at a higher precision than > the operand types. This doesn't explicitly include return values, but... > > 3. D allows inlining, which means even if return values are explicitly > excluded from (2) then it still can't be relied upon. > > So, I conclude that calculating and returning floating point values at a > higher precision than specified is allowed in D. > > The code where you hit the problem can be fixed by: > > 1. Fixing the tests to not rely on maximum precision. The problem is that value isn't "more price". The value is wrong because truncation should be performed during ALL calculation. > 2. Using a wrapper on function calls where precision must be forced. (An > intrinsic to force precision loss has been discussed in the past) It should be performed during all std.numeric.summation ;( > 3. Lobbying Walter for a language change enforcing this in some way. Can you help me to do it? > I've changed the platform as this behaviour is possible on other platforms > too even if the current code generator never hits those cases.
Comment #18 by yebblies — 2015-02-16T06:50:53Z
(In reply to Илья Ярошенко from comment #17) > (In reply to yebblies from comment #15) > > Why you have changed the name of this issue? > Bug is not only for return. Truncation should be during all calculations. > I changed the title because the old one was not accurate. The optimizer is working as designed, even if that design is not what you want. Please change the title if you want to request no extra intermediate precision anywhere, I just kept it to the scope of the original report. > > Yes. D is unique language in this question. > In other other words D is not a system language because it can't compile > correctly IEEE 754 standard. > I'm not sure this counts as violating IEEE 754, though I'm not an expert. C has the same problem, so I guess C isn't a system language either by that logic. > I am disappointed. > All D contributors explain me that "it is a feature for precise math". > But this feature prohibits price math: > https://github.com/D-Programming-Language/phobos/pull/2513. > Maybe. The logic in here does seem sound, although again I'm not an expert. http://dlang.org/d-floating-point.html So the idea is that strict double rounding would be a big performance hit, and > The problem is that value isn't "more price". The value is wrong because > truncation should be performed during ALL calculation. Technically the value _is_ more precise. The way it interacts with rounding is what produces the incorrect result. Under the current spec the rounding is not required. > > > 3. Lobbying Walter for a language change enforcing this in some way. > > Can you help me to do it? > I don't know enough about the algorithms you're using to be sure either you or Walter is right here. You may want to comment on issue 9937 and see if Don agrees with you.
Comment #19 by ilyayaroshenko — 2015-02-16T06:58:24Z
(In reply to yebblies from comment #18) > (In reply to Илья Ярошенко from comment #17) > > (In reply to yebblies from comment #15) > > > > Why you have changed the name of this issue? > > Bug is not only for return. Truncation should be during all calculations. > > > > I changed the title because the old one was not accurate. The optimizer is > working as designed, even if that design is not what you want. > > Please change the title if you want to request no extra intermediate > precision anywhere, I just kept it to the scope of the original report. > > > > > Yes. D is unique language in this question. > > In other other words D is not a system language because it can't compile > > correctly IEEE 754 standard. > > > > I'm not sure this counts as violating IEEE 754, though I'm not an expert. C > has the same problem, so I guess C isn't a system language either by that > logic. C has this problem only when special compiler flags enabled. All this algorithms are from Python source code (C) or Wikipedia(C, Pascal). > > I am disappointed. > > All D contributors explain me that "it is a feature for precise math". > > But this feature prohibits price math: > > https://github.com/D-Programming-Language/phobos/pull/2513. > > > > Maybe. The logic in here does seem sound, although again I'm not an expert. > > http://dlang.org/d-floating-point.html > > So the idea is that strict double rounding would be a big performance hit, > and > > > The problem is that value isn't "more price". The value is wrong because > > truncation should be performed during ALL calculation. > > Technically the value _is_ more precise. The way it interacts with rounding > is what produces the incorrect result. > > Under the current spec the rounding is not required. > > > > > > 3. Lobbying Walter for a language change enforcing this in some way. > > > > Can you help me to do it? > > > > I don't know enough about the algorithms you're using to be sure either you > or Walter is right here. You may want to comment on issue 9937 and see if > Don agrees with you.
Comment #20 by yebblies — 2015-02-16T07:05:12Z
(In reply to Илья Ярошенко from comment #19) > > C has this problem only when special compiler flags enabled. > All this algorithms are from Python source code (C) or Wikipedia(C, Pascal). C does allow you to specify when excess precision should be lost, but it still has extra precision enabled for intermediate expressions IIUC. So the problem still exists, but is easier to work around.
Comment #21 by code — 2016-09-29T10:30:20Z
(In reply to yebblies from comment #18) > Maybe. The logic in here does seem sound, although again I'm not an expert. > > http://dlang.org/d-floating-point.html > > So the idea is that strict double rounding would be a big performance hit That doesn't make too much sense, and we shouldn't adapt the language to an x87 "coprocessor", using the old FPU nowadays is a performance hit and should be avoided unless you absolutely need the extra 16-bits of precision. As C get's away with it's default behavior, I don't think need to make -ffast-math our default.
Comment #22 by bugzilla — 2016-11-07T09:32:38Z
This boils down to the following code: double foo(double x, double t, double s, double c) { double y = x - t; c += y + s; return s + c; } The body of which, when optimized, looks like: return s + (c + (x - t) + s); Or, in x87 instructions: fld qword ptr 01Ch[ESP] fld qword ptr 0Ch[ESP] fxch ST(1) fsub qword ptr 014h[ESP] fadd qword ptr 0Ch[ESP] fadd qword ptr 4[ESP] fstp qword ptr 4[ESP] fadd qword ptr 4[ESP] ret 020h The algorithm relies on rounding to double precision of the (x-t) calculation. The only way to get the x87 to do that is to actually assign it to memory. But the compiler optimizes away the assignment to memory, because it is substantially slower. The 64 bit code does not have this problem, because the code gen looks like: push RBP mov RBP,RSP movsd XMM4,XMM0 movsd XMM5,XMM1 subsd XMM3,XMM2 addsd XMM3,XMM5 addsd XMM4,XMM3 movsd XMM0,XMM5 addsd XMM0,XMM4 pop RBP ret It's doing the same optimization, but the result is rounded to double because the XMM registers are doubles. Note that the following targets generate x87 code, not XMM code: Win32, Linux32, FreeBSD32 because it is not guaranteed that the target has XMM registers. I suspect we don't really care about the floating point performance on those targets, but we do care that the code gives expected results. So I propose that the fix is to disable optimizing away the assignment to y for x87 code gen targets.
Comment #23 by yebblies — 2016-11-07T10:31:10Z
(In reply to Walter Bright from comment #22) > > So I propose that the fix is to disable optimizing away the assignment to y > for x87 code gen targets. Are you suggesting disabling that optimization always, or allowing the programmer to specify that that particular assignment shouldn't be optimized? If the latter, I would rather stop supporting targets without xmm regs than stop producing fast code on Win32 etc. If the former, I think that's well covered by adding an intrinsic, so the code becomes: double foo(double x, double t, double s, double c) { double y = __builtin_that_forces_rounding_to_double(x - t); c += y + s; return s + c; } And this seems like something that could be handled fairly easily in the dmd backend. I think this covers all the cases where rounding must be required.
Comment #24 by ilyayaroshenko — 2016-11-07T10:44:05Z
(In reply to yebblies from comment #23) > I would rather stop supporting targets without xmm regs than > stop producing fast code on Win32 etc. +1
Comment #25 by bugzilla — 2016-11-07T11:01:20Z
Comment #26 by bugzilla — 2016-11-07T11:05:36Z
> I think that's well covered by adding an intrinsic, I produced a PR request for that in druntime. Nobody liked it, and it languishes unpulled. https://github.com/dlang/druntime/pull/1621
Comment #27 by bugzilla — 2016-11-07T11:12:27Z
> stop supporting targets without xmm regs A couple problems with this: 1. It is unknown what 32 bit x86 CPUs are used for embedded systems. I dislike adding more codegen switches, because every switch doubles the time it takes to run the test suite, and few developers set them correctly. (Who ever sets that blizzard of switches gcc has correctly?) 2. It's not a simple matter of turning it on, even though dmd generates XMM code for OSX 32 bit. The trouble is in getting the stack aligned to 16 bytes. The Linux way of doing that is different from OSX, so there's some significant dev work to do to match it. I believe that making faster 64 bit code should have priority over making faster 32 bit code, based on the idea that users who feel the need for speed are going to be using -m64.
Comment #28 by yebblies — 2016-11-07T13:08:58Z
(In reply to Walter Bright from comment #26) > > I think that's well covered by adding an intrinsic, > > I produced a PR request for that in druntime. Nobody liked it, and it > languishes unpulled. > > https://github.com/dlang/druntime/pull/1621 I must have missed that. I'm happy to review/merge dmd changes related to that. I'm worried the other approach will just cause a performance issue that's impossible to work around. > 1. It is unknown what 32 bit x86 CPUs are used for embedded systems. I dislike adding more codegen switches, because every switch doubles the time it takes to run the test suite, and few developers set them correctly. (Who ever sets that blizzard of switches gcc has correctly?) Who is using dmd on an embedded system? Why? Embedded system users are exactly the people who are setting gcc switches correctly. > 2. It's not a simple matter of turning it on, even though dmd generates XMM code for OSX 32 bit. The trouble is in getting the stack aligned to 16 bytes. The Linux way of doing that is different from OSX, so there's some significant dev work to do to match it. Yeah, I know. Then again, wouldn't using unaligned loads/stores still be faster than using the x87? Last I checked, it was... not fast. > I believe that making faster 64 bit code should have priority over making faster 32 bit > code, based on the idea that users who feel the need for speed are going to be using -m64. It's much easier to switch over to m64 on linux, which is why I'm still using m32 on windows. One day... Can you put together a dmd PR to go with druntime 1621? I'm guessing it's pretty easy, since a new OPER will default to not being optimized?
Comment #29 by bugzilla — 2016-11-07T22:33:58Z
(In reply to yebblies from comment #28) > (In reply to Walter Bright from comment #26) > I must have missed that. I'm happy to review/merge dmd changes related to > that. No dmd changes are necessary, it will work as is. I designed it as an intrinsic in case some future code gen scheme will cause it to not work. > I'm worried the other approach will just cause a performance issue > that's impossible to work around. It could be worked around by using reals as temporaries instead of doubles, but few programmers have that level of understanding of how floating point works. > > 1. It is unknown what 32 bit x86 CPUs are used for embedded systems. I dislike adding more codegen switches, because every switch doubles the time it takes to run the test suite, and few developers set them correctly. (Who ever sets that blizzard of switches gcc has correctly?) > > Who is using dmd on an embedded system? Why? I don't know. I'm reluctant to just break all their code just because I am ignorant of them. > Embedded system users are > exactly the people who are setting gcc switches correctly. In my experience with embedded systems developers, they aren't any more sophisticated with detailed feature switch settings than any other systems code developer. Most just copy the switch settings from project to project, in the process losing any information about why those settings were set to begin with. > Then again, wouldn't using unaligned loads/stores still be > faster than using the x87? Last I checked, it was... not fast. I don't know. I also overlooked another point - the 32 bit ABI still uses the ST0 register for floating point returns. Not sure what gcc does about that. Anyhow, there is clearly some not insignificant engineering work to be done for that. It's a question of whether it is worth it. > Can you put together a dmd PR to go with druntime 1621? I'm guessing it's > pretty easy, since a new OPER will default to not being optimized? I as a mentioned, no dmd changes are currently necessary. One last point. Yes, the intrinsics will work and will be a more efficient solution. The problem, though, is people will port code that works, or will type in code from a book, and then it will not work, and they will blame D. Not many will know just where the intrinsics will need to be inserted. How many here realized the store to t was the problem?
Comment #30 by github-bugzilla — 2016-11-10T01:02:19Z
Commits pushed to master at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/6db2246e97c790e0988f024ccb25d0fb090d609a fix Issue 13474 - Discard excess precision for float and double (x87) https://github.com/dlang/dmd/commit/b9d6be259e2e54c66d8361675b65f717dd5e3fc4 Merge pull request #6247 from WalterBright/fix13474 fix Issue 13474 - Discard excess precision for float and double (x87)
Comment #31 by github-bugzilla — 2016-12-27T14:40:58Z
Commits pushed to scope at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/6db2246e97c790e0988f024ccb25d0fb090d609a fix Issue 13474 - Discard excess precision for float and double (x87) https://github.com/dlang/dmd/commit/b9d6be259e2e54c66d8361675b65f717dd5e3fc4 Merge pull request #6247 from WalterBright/fix13474
Comment #32 by github-bugzilla — 2017-01-16T23:24:41Z
Commits pushed to newCTFE at https://github.com/dlang/dmd https://github.com/dlang/dmd/commit/6db2246e97c790e0988f024ccb25d0fb090d609a fix Issue 13474 - Discard excess precision for float and double (x87) https://github.com/dlang/dmd/commit/b9d6be259e2e54c66d8361675b65f717dd5e3fc4 Merge pull request #6247 from WalterBright/fix13474