Take this D program
------
import std.conv;
import std.datetime;
import std.stdio;
void main()
{
auto sw = StopWatch(AutoStart.yes);
try
throw new Exception("blah");
catch(Exception e)
{
auto diff = to!Duration(sw.peek());
writeln(diff.total!"hnsecs"());
writeln(diff);
}
}
------
and this Java program
------
class j
{
public static final void main(String[] args)
{
long before = System.nanoTime();
try
{
throw new Exception("blah");
}
catch(Exception e)
{
long diff = System.nanoTime() - before;
long totalHNSecs = diff / 100;
long msecs = diff / 1000000;
diff -= msecs * 1000000;
long usecs = diff / 1000;
diff -= usecs * 1000;
long hnsecs = diff / 100;
System.out.println(totalHNSecs);
System.out.printf("%sms %sus %shnsecs\n", msecs, usecs, hnsecs);
}
}
}
------
They are roughly equivalent, but their performance is drastically different (and Java wins by a _long_ shot). These are typical outputs for the D version
13962
1 ms, 396 μs, and 2 hnsecs
13658
1 ms, 365 μs, and 8 hnsecs
13745
1 ms, 374 μs, and 5 hnsecs
whereas tese are typical outputs for the Java version
126
0ms 12us 6hnsecs
126
0ms 12us 6hnsecs
140
0ms 14us 0hnsecs
It varies a bit from execution to execution (particularly for D), but in general, I'm seeing that D is taking 90x - 110x longer than Java is. Oddly, once in a blue moon, D seems to take more like 20x longer, but even that's a huge difference (and not in D's favor). In general, Java shouldn't be beating D for performance, since D is a systems language, and a magnitute of 100x is pathetic.
Granted, code should not be constantly throwing and catching exeptions, but this sort of bad performance can make it so that exceptions have far more of a negative impact then they should. The place that this is particularly negative is in unit tests. Good unit tests will test that functions act appropriately when given bad input, which generally means that they throw exceptions, and the tests catch them, and right now, if you do much of that, unit tests start taking a very long time to execute - unacceptably long. This makes it so that code must be tested less in order for the tests to take a reasonable amount of time, which will lead to worse code quality.
So, I think that it's clear that we need to take steps to improve the performance of throwing and catching exceptions. As I understand it, the stacktrace is currently put into string format when the exception is constructed, which is unnecessary overhead if the stack trace never gets printed, so that's one improvement that could be made. But I have no idea how the internals of any of this currently work, so I don't know what other improvements could be made. Regardless, we need to take steps to significantly improve the situation.
Note that these tests were done on Linux. I don't know how the times compare on Windows, though I expect that the situation is similar there.
Comment #1 by bearophile_hugs — 2013-02-24T16:35:02Z
> So, I think that it's clear that we need to take steps to improve the
> performance of throwing and catching exceptions.
I agree.
My timings on Windows 32 bit:
D (latest DMD, -O -release -inline -noboundscheck):
1542590
154 ms and 259 us
69839
6 ms, 983 us, and 9 hnsecs
69741
6 ms, 974 us, and 1 hnsec
70468
7 ms, 46 us, and 8 hnsecs
70793
7 ms, 79 us, and 3 hnsecs
70539
7 ms, 53 us, and 9 hnsecs
69626
6 ms, 962 us, and 6 hnsecs
Java (version 1.7.0_13), same system:
260
0ms 26us 0hnsecs
243
0ms 24us 3hnsecs
241
0ms 24us 1hnsecs
241
0ms 24us 1hnsecs
235
0ms 23us 5hnsecs
231
0ms 23us 1hnsecs
250
0ms 25us 0hnsecs
Comment #2 by bearophile_hugs — 2013-02-24T16:54:50Z
Created attachment 1194
An old exception benchmark
In the attach there is an old exception benchmark I have kept, it compares the same code in Java, C++, D, Python, C. On my system even Python is quite faster than D, not just Java.
The C code uses longjmp/setjmp, it's not directly comparable, but it's useful for a baseline data point.
Comment #3 by clugdbug — 2013-02-25T03:49:32Z
Exception handling is not intrinsically very slow. Basically there are two function calls per stack frame, each of which only executes a handful of instructions. All of the data that is accessed lies on the stack, so it should be in cache. On Windows, there are a few system calls during stack unwinding, but they are also extremely quick.
But even so, I'm a bit suspicious of that Java code. Is it really throwing an exception? If it contains any exception handling optimisations at all, it would optimize the throw/catch away.
Comment #4 by bearophile_hugs — 2013-02-25T04:17:11Z
(In reply to comment #3)
> I'm a bit suspicious of that Java code. Is it really throwing an
> exception? If it contains any exception handling optimisations at all, it would
> optimize the throw/catch away.
Maybe the benchmark in attach doesn't have this problem.
Comment #5 by issues.dlang — 2013-02-25T10:34:29Z
> But even so, I'm a bit suspicious of that Java code. Is it really throwing an
> exception? If it contains any exception handling optimisations at all, it
> would optimize the throw/catch away.
Hmm. A valid point. I hadn't thought of that. However, even just adding a line to make it print the exception makes it take only about 250 - 350 microseconds (which includes whatever cost there is to actually printing) on a machine that's taking around 15 microseconds with the code in the bug, and if I make it save the string from the exception before the end of the timing code but put the print statement for the exception (now printing the saved string rather than the exception) it's taking about 35 microseconds. Wrapping the throw in another function doesn't seem to have much effect on performance either.
So, it may be that there's some optimizing going on with regards to the exception in Java, but even with actually using it (which I would think would force it to not be optimzed away), the D implementation is still taking orders of magnitude longer than the Java implementation.
Comment #6 by clugdbug — 2013-02-27T01:04:08Z
Java does indeed do a lot of optimization of exceptions.
http://www.javaspecialists.eu/archive/Issue187.html
D does none. It's possible that this is what you're seeing. I can't yet tell if there is a bug here or not.
Comment #7 by issues.dlang — 2013-02-27T01:12:16Z
> D does none. It's possible that this is what you're seeing. I can't yet tell
> if there is a bug here or not.
There's a bug in the sense that D's exception performance sucks. If Java's doing enough fancy optimizations, it may be that we can't match its performance (especially if it's taking advantage of JVM stuff to do that), but unless there is a fundamental reason why D's exceptions cannot be made faster, then they should be. As it stands, it's ludicrously expensive to even just write unit tests which verify that a function throws when it's supposed to, because the cost of those exceptions quickly dwarfs everything else going on in the tests, making them take orders of magnitude longer.
Comment #8 by bearophile_hugs — 2013-12-22T04:11:58Z
The situation is improved, now in my benchmarks D exceptions are about only 2-2.5 times slower than Java exceptions.
Further improvements in the GC or the optimizations linked here by Don (that I was not able to reproduce in the latest JavaVM) could bring the two even more on parity:
http://www.javaspecialists.eu/archive/Issue187.html
Comment #9 by hsteoh — 2013-12-24T11:58:33Z
(In reply to comment #8)
> The situation is improved, now in my benchmarks D exceptions are about only
> 2-2.5 times slower than Java exceptions.
>
> Further improvements in the GC or the optimizations linked here by Don (that I
> was not able to reproduce in the latest JavaVM) could bring the two even more
> on parity:
> http://www.javaspecialists.eu/archive/Issue187.html
According to that article, one of the main causes of slow exceptions is the construction of the stack trace. I'm wondering, in D, if it's possible to construct the stack trace *lazily*?
I.e., when we first create the Exception object, we do not immediately walk the stack and generate trace entries for it; instead, we let the stack unwinder incrementally create the trace as it unwinds the stack (since it's basically walking up the stack anyway). When it reaches a catch block, of course, the stack trace is still incomplete, since it has only been constructed up to that point. But we still don't have to walk the rest of the stack at this point: we can keep the partial stack trace in an incomplete state, and only if the catch block tries to read the stack trace, then we will walk the rest of the stack to create the entire trace. If the Exception gets rethrown, then we let the unwinder continue to add more entries to the partial trace. If the Exception gets ignored, then we discard the partial trace -- and we have saved the cost of walking the rest of the stack.
Furthermore, we can keep the partial stack trace to a minimal representation, e.g., as a list of function return addresses in binary, and only format it nicely (lookup symbols, etc.) when the catch block actually asks for it. Then this will also save on the cost of formatting the stack trace until it's actually needed, or if the catch block just discards the Exception without ever looking at the trace, then we've saved the cost of formatting it.
As for returning the same exception instance if the rate of throws is too high (as described by the linked article): this may be more complicated to implement, since you'll have to change the meaning of 'new' when applied to an exception class. Currently in D, we first 'new' the Exception object (which may be a derived class, so it's not trivial to check whether it's an instance of Exception), then we throw it. I've written code before where the Exception object is created by a separate function, and thrown elsewhere. It would require rather major language changes to be able to return the same Exception instance if the rate of throws is too high.
Whereas if we construct the stack trace lazily, we can save on the cost of *all* Exceptions (most catch blocks don't bother reading the stack trace, for example) with minimal/no language changes.
Comment #10 by bearophile_hugs — 2013-12-24T17:18:58Z
(In reply to comment #9)
> As for returning the same exception instance if the rate of throws is too high
I have run the little programs in the linked article, and I have not seen that returning of the same exception using a modern JavaVM (some years have passed since that article). So I am not sure such optimization is still present.
Comment #11 by sean — 2014-02-07T11:00:14Z
I think traces could probably be done lazily. A pointer to the stack top is available on throw, and all the code executed from that point should be above that, so the stack should be preserved. It's worth a shot.
Comment #12 by hsteoh — 2014-02-07T11:24:00Z
(In reply to comment #11)
> I think traces could probably be done lazily. A pointer to the stack top is
> available on throw, and all the code executed from that point should be above
> that, so the stack should be preserved. It's worth a shot.
But wouldn't that be trashed by the stack unwinding? I mean, if a catch block makes function calls before looking at the trace, wouldn't that scramble what was there?
Comment #13 by sean — 2014-02-07T16:06:57Z
(In reply to comment #12)
>
> But wouldn't that be trashed by the stack unwinding? I mean, if a catch block
> makes function calls before looking at the trace, wouldn't that scramble what
> was there?
Yeah. I don't think it would be possible to delay tracing until the trace is actually requested. The best you could do is delay tracing until entering a catch block, but that's really no different from tracing on throw.
Comment #14 by hsteoh — 2014-02-07T16:50:27Z
True, but at least we can lazily construct it as we unwind, so if the stack is 20 function calls deep and the catch block is only 5 functions up, then we defer constructing 15 levels of call stack until the user asks for the trace. Also, constructing the string representation of the trace can be deferred until then, too, so expensive things like symbol lookups are avoided if nobody ever looks at the trace.
Comment #15 by destructionator — 2014-02-07T17:02:32Z
I just played with this a little. Regular set up: ~ 270ms for the exception.
Add Runtime.traceHandler = null;, then run, down to about ~7ms per exception.
Then if I allocate the throwables on a separate buffer too http://arsdnet.net/dcode/except.d it goes down to about ~4ms for a single exception. (I expect this might make a bigger difference on really long running things.)
But it is clear that the trace handler is much larger than the throwable allocation.
Maybe we shouldn't do the stack trace on non-debug builds. Or at least publicize the traceHandler line so people can opt-out of it.
As for doing it lazily, that would be a good idea, not sure if we can, but one possibility might be to copy the callstack but NOT call backtrace_symbols until the last possible moment. Replacing that function with a stub that just returns null brings the time to 12 ms (all other things unchanged). That's a significant improvement; 20x faster than the normal.
But, am I right that backtrace_symbols just needs the callstack thing? ... which is a static array right here. Do we have to call backtrace_symbols in the constructor?
hmmm apparently yes, I tried moving it, got a huge speed improvement (~10ms)... but it segfaulted on an assertion failure after printing the stack trace.... but I think this is because of demangling rather than any intrinsic problem.
I'm still looking at it, but I think this might be fixable and can give a significant improvement.
Comment #16 by destructionator — 2014-02-07T17:13:03Z
I got it kinda working, though I included one bit of a hack (I'm not sure if this is right or not, I don't fully understand why that assertion is there in the first place, but I figured if it was going to trigger, just return without demangling. Seems to give the same output as the regular handler.)
Relevant code is at the bottom:
http://arsdnet.net/dcode/except2.d
I copy/pasted the trace handler from druntime then hacked it up so backtrace_symbols is called in the opApply rather than the ctor. This also meant moving the symbol list from a local variable to a class member and freeing the memory with a local scope guard instead of the dtor. Then hacked around the failing assert (line 291).
Chained exceptions still work. Demangling still works. Backtraces still printed on toString. We could perhaps cache the result of the symbol getter btw, I didn't here primarily because opApply is a const method so that'd mean casting away this const (the whole logical const caching issue lol). But who calls toString more than once on an exception anyway? So meh I don't think it matters. Besides I'd rather that be slow than the typical throw+catch pair being slow.
Also I'm on Linux so idk if this will be better on other systems but it looks to me like the other OSes share this code so maybe it will rock there.
but yeah I think lazy backtrace is a win. test plz
Comment #17 by hsteoh — 2014-02-07T17:26:18Z
Whoa. If this works out well, D exceptions could finally beat Java's!
Comment #18 by issues.dlang — 2014-02-07T17:36:50Z
> Maybe we shouldn't do the stack trace on non-debug builds.
Bad idea IMHO. Exceptions and their stack traces can still be logged in release builds, which can be useful when a customer has an issue.
The results of your experimentation sound promising though.
> Whoa. If this works out well, D exceptions could finally beat Java's!
That would definitely be cool. Having fast exceptions would be a huge boon IMHO, since code written with exceptions is generally much cleaner, and if we can generally use exceptions in performance-critical code, then we can generally have cleaner performant code. There will probably always be code that will need to avoid exceptions for one reason or another, but if we can make it so that their negative impact is minimized, that would be fantastic. Right now, it's pretty bad.
Comment #19 by destructionator — 2014-02-07T19:22:45Z
Comment #20 by destructionator — 2014-02-07T20:10:53Z
My pull request there is posix only. A win but not complete without looking at Windows too.
And I'm not sure I understand the Windows stack trace code. There must be something lazy initialized in there since the first exception handled is significantly slower than any subsequent ones, with or without the trace handler. At first I thought it was loading the debug help library, but that seems to be done in a static module constructor. Maybe it is deferred anyway, idk.
But anywho with the trace handler turned on, the second exception clocks in at about 250 us. With the trace handler turned off, the second exception comes in at about 30, same ballpark as linux after the change. So we are still looking at a huge potential improvement here. But I'm not sure where and my windows dev environment sucks. Maybe we can move the call to m_trace = trace out, into opApply too, just storing the CONTEXT*. Could work, but then again, the trace method says it returns a thing to be used by resolve later on, so maybe it is important to be done here.
idk, gotta experiment with it and I'm out of time tonight. At least the Linux change seems to be a big success so far.
BTW I can't help but wonder if we can't shave a few more useconds off by getting the trace info objects from a small reusable pool instead of doing `new` each time we throw an exception. I think we should get a small, cost-free (from the user perspective) win there too.
Comment #21 by bearophile_hugs — 2014-02-08T14:56:55Z
Comment #22 by bearophile_hugs — 2014-02-08T15:00:14Z
(In reply to comment #21)
> Have you tried to run the D&Java benchmark attached to this Issue 9584 ?
Sorry, please ignore this question. I have just read pull 717 and I've seen you have done all things.
Comment #23 by destructionator — 2014-02-08T15:23:19Z
(In reply to comment #21)
> Have you tried to run the D&Java benchmark attached to this Issue 9584 ?
I didn't run these specifically, I just did the try/catch Jonathan wrote. Let me try them... the D one is much improved.
Before:
$ time ./except_d 5000
Exceptions: HI=2500 / LO=2500
real 0m0.840s
After:
$ time ./except_d 5000
Exceptions: HI=2500 / LO=2500
real 0m0.010s
Almost 100x better in this test. This is on Linux. druntime on git already has a similar optimization for Windows, so a huge improvement should be seen there on the next dmd release too.
Let's do 500,000 iterations with Java to hide the startup cost to be more fair (note my Java stuff is old, it is jdk 1.6.0):
$ time java except 500000
Exceptions: HI=250000 / LO=250000
real 0m0.666s
And D for 500000 runs
$ time ./except_d 500000
Exceptions: HI=250000 / LO=250000
real 0m0.758s
So we are a little bit slower for a large number of iterations but it is pretty close now. (BTW compiling with -O -inline etc. makes no significant difference) Certainly a lot better than before.
Comment #24 by destructionator — 2014-02-20T13:59:42Z
The pull request was just merged with my new code for linux, and windows already had improved code.
Jonathan, can you confirm the new version is better on your computer too and close the bug if yes?
Comment #25 by issues.dlang — 2014-03-06T23:33:28Z
Yay! We're now actually beating Java. With the example, I'm seeing around 12 or 13 microseconds with Java (same as before), and I'm now only seeing 7 or 8 microseconds with D. So, we're taking about 3/4 of the time that Java is. With 2.064, I'm seeing around 260 to 320 microseconds, so that's a considerable improvement.
It's also quite a bit better than the numbers that I initially reported, but I have updated my computer since then, so I don't know how much of the difference is my computer and how much of it is other improvements that were made to the code since the bug was reported. And actually, the fact that I'm seeing the same times for Java that I did before, when I have a faster machine now is rather curious. I have no clue why that would be.
Regardless, this is a huge improvement for us. Thanks!
Comment #26 by hsteoh — 2014-03-07T17:06:46Z
Hooray! Now I can feel comfortable about using exceptions in my code! Well technically speaking, it shouldn't really matter that much since exceptions are, well, exceptional, so they should never happen where performance matters. But still, it feels good to one up Java. :P