Bug 18388 – std.experimental.logger slow performance

Status
RESOLVED
Resolution
LATER
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2018-02-07T08:16:58Z
Last change time
2018-12-18T23:19:05Z
Assigned to
No Owner
Creator
Arun Chandrasekaran

Comments

Comment #0 by aruncxy — 2018-02-07T08:16:58Z
std.experimental.logger is very slow when compared to some implementation in C++ (spdlog) From my initial observations, the max CPU usage per thread doesn't exceed 30% on average. 07-02-2018 00:09:00 vaalaham ~/code/cpp/spdlog/bench $ make >/dev/null 2>&1 && time ./spdlog-bench-mt 8 real 0m0.562s user 0m1.480s sys 0m1.620s 07-02-2018 00:09:05 vaalaham ~/code/cpp/spdlog/bench $ 07-02-2018 00:09:26 vaalaham ~/code/d/std-log-benchmark $ time ./std-log-benchmark 8 1000000 real 0m5.617s user 0m4.048s sys 0m3.674s 07-02-2018 00:09:34 vaalaham ~/code/d/std-log-benchmark $ import core.atomic; import core.thread; import std.parallelism; import std.experimental.logger; import std.stdio; import std.conv : to; shared int msgCounter = 0; shared int maxCount = 1_000_000; __gshared FileLogger _logger; void main(string[] args) { if (args.length != 3) { writefln("Usage: %s <thread-count> <loop-count>", args[0]); return; } _logger = new FileLogger("/dev/null", LogLevel.trace); const threadCount = to!int(args[1]); maxCount = to!int(args[2]) - threadCount; auto pool = new TaskPool(threadCount); foreach (tid; 0 .. threadCount) pool.put(task!logMe(tid + 1)); pool.finish; } void logMe(const int tid) { while (true) { _logger.tracef("From T-%s log message #%s: This is some text for your pleasure", tid, atomicOp!"+="(msgCounter, 1)); if (atomicLoad(msgCounter) > maxCount) break; } }
Comment #1 by aruncxy — 2018-02-07T08:23:07Z
Platform: Ubuntu 17.10 - 64 bit. Compiler: LDC - the LLVM D compiler (1.7.0): based on DMD v2.077.1 and LLVM 5.0.1 built with LDC - the LLVM D compiler (1.7.0) Default target: x86_64-unknown-linux-gnu Host CPU: skylake http://dlang.org - http://wiki.dlang.org/LDC Registered Targets: aarch64 - AArch64 (little endian) aarch64_be - AArch64 (big endian) arm - ARM arm64 - ARM64 (little endian) armeb - ARM (big endian) nvptx - NVIDIA PTX 32-bit nvptx64 - NVIDIA PTX 64-bit ppc32 - PowerPC 32 ppc64 - PowerPC 64 ppc64le - PowerPC 64 LE thumb - Thumb thumbeb - Thumb (big endian) x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64
Comment #2 by dfj1esp02 — 2018-02-08T09:51:44Z
AFAIK, default logger implementations are simplistic and are not meant to be very fast. Fast logging requires a bit of design and is supposed to be 3rd party library, while std.experimental.logger provides a way to integrate it.
Comment #3 by aruncxy — 2018-02-08T16:54:17Z
(In reply to anonymous4 from comment #2) > AFAIK, default logger implementations are simplistic and are not meant to be > very fast. Fast logging requires a bit of design and is supposed to be 3rd > party library, while std.experimental.logger provides a way to integrate it. Thanks, but that's weird. If that's the case, why would anyone want to use std.experimental.logger after all? Everyone will default to using their own logger implementations not derived out of std.experimental.logger, knowing that it is *slow*. I'm pretty sure that's not the goal of Phobos. May be something was overlooked in the design or implementation that causes this slowness?
Comment #4 by aruncxy — 2018-09-05T04:11:10Z
On the same box with ldc 1.12.0-beta1, I see that the performance has improved by around 20%. 04-09-2018 21:06:15 vaalaham ~/code/d/std-log-benchmark $ time ./std-log-benchmark 8 1000000 real 0m3.958s user 0m3.457s sys 0m2.840s 04-09-2018 21:08:47 vaalaham ~/code/d/std-log-benchmark $
Comment #5 by rburners — 2018-09-05T07:33:23Z
The performance problem is not with std.experimental.logger, it is with the benchmark. The amount thread contention is killing all performance. The benchmark becomes single threaded around three times for each line printed. 2x atomic ops on msgCounter, 1x writing the line to file.
Comment #6 by aruncxy — 2018-09-06T19:06:20Z
> The benchmark becomes single threaded around three times for each line printed. How is it a benchmark issue? I agree that three API for a single log line is expensive. Isn't that a problem with std.experimental.logger? I also see that a flush after every line is causing the slowdown as well. Probably do a flush only when the message level is fatal/error/warn and also during termination?
Comment #7 by rburners — 2018-09-06T20:14:05Z
I didn't say anything about api calls. That is not the problem. The problem with the benchmark is that the threads share memory. That means each write will, given you tested on a multicore cpu, invalidates some of the CPUs caches. That means the program has to go to RAM, and that is slow, really slow. I bet you a drink at next years DConf, that if you check perf you will find that the CPU is waiting for data from RAM most of the execution time. You need to flush after each log call. If there is a log buffer and the program crashes, you might not see the log line that indicates the problem.
Comment #8 by aruncxy — 2018-09-06T22:09:24Z
spdlog (C++) does exactly the same thing and it is much faster as I shared the stats earlier. It is apples to apples, IMO. So the contention is same in both C++ and D versions. In fact, D does much better with atomic operations. Disabling the log statements with 8 threads and 100000000 iterations, C++ version takes real 0m3.167s user 0m25.215s sys 0m0.004s whereas D version takes real 0m2.527s user 0m20.124s sys 0m0.000s You can compare the asm at https://godbolt.org/z/JjfTyw
Comment #9 by aruncxy — 2018-09-06T22:29:20Z
> You need to flush after each log call. If there is a log buffer and the program crashes, you might not see the log line that indicates the problem That's right, but may be worth doing on a configuration basis (async mode)? Also I'm not sure how formattedWrite is implemented. As I see it writes piece by piece into the LockingWriter. The log string could probably be constructed first, followed by a single write into the sink. This will avoid multiple chunked writes.
Comment #10 by aruncxy — 2018-09-07T02:07:10Z
s/_logger.tracef/printf/ 06-09-2018 18:25:51 vaalaham ~/code/d/std-log-benchmark $ time ./std-log-benchmark 8 1000000 > /dev/null real 0m0.495s user 0m1.254s sys 0m2.100s 06-09-2018 19:02:22 vaalaham ~/code/d/std-log-benchmark $ So the slowdown is definitely related to phobos.
Comment #11 by aruncxy — 2018-09-07T03:09:14Z
s/_logger.tracef/writeln 06-09-2018 20:05:27 vaalaham ~/code/d/std-log-benchmark $ time ./std-log-benchmark 8 1000000 | tail real 0m0.975s user 0m2.067s sys 0m2.418s 06-09-2018 20:05:29 vaalaham ~/code/d/std-log-benchmark I can infer that std.format is probably a little slower than fmtlib (C++) https://github.com/fmtlib/fmt But I need to test more to confirm this claim. But for sure std.experimental.logger is slow.
Comment #12 by rburners — 2018-09-07T09:52:01Z
Building the string in memory will put quite some pressure on the GC. Using a stack based array is limiting and not @safe. formattedWrite can do more than printf. I stayed away from allowing configurations, because it is really hard to know what should be a config option and what not. So the idea is to inherit and change in code what need to be different.
Comment #13 by aruncxy — 2018-12-18T23:19:05Z
Closing as of now as per the last comment - we can reopen in future if required.