Bug 1448 – UTF-8 output to console is seriously broken

Status
REOPENED
Severity
normal
Priority
P3
Component
dmd
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2007-08-28T22:51:06Z
Last change time
2024-12-13T17:47:58Z
Assigned to
No Owner
Creator
Alexander Solovey
See also
https://issues.dlang.org/show_bug.cgi?id=2742, https://issues.dlang.org/show_bug.cgi?id=12990, https://issues.dlang.org/show_bug.cgi?id=15845, https://issues.dlang.org/show_bug.cgi?id=15761, https://issues.dlang.org/show_bug.cgi?id=24001
Moved to GitHub: dmd#17629 →

Attachments

IDFilenameSummaryContent-TypeSize
172utf8_bug1.cSmall test cae for the same problem in DMCapplication/octet-stream336

Comments

Comment #0 by a.solovey — 2007-08-28T22:51:06Z
If windows console code page is set to 65001 (UTF-8) and program outputs non-ascii characters in UTF-8 encoding, there will be no more output after the first new line after accented character. I believe that problem is in underlying DMC stdio, but it is more disturbing with D as it has good Unicode support and it is very convenient to work international texts in it. This problem has been reported in newsgroup several times before, see for example http://www.digitalmars.com/d/archives/digitalmars/D/announce/openquran_v0.21_8492.html Here is the code to illustrate the problem: //////// import std.c.stdio; import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP( UINT ); void main() { SetConsoleOutputCP( 65001 ); // or use "chcp 65001" instead // Codepoint 00e9 is "Latin small letter e with acute" puts( "Output utf-8 accented char \u00e9\n... and the rest is cut off!\n" ); } ///////// If you run it, "... and the rest is cut off!" won't be displayed. Do not forget to set console font to Lucida Console before trying this.
Comment #1 by a.solovey — 2007-08-28T22:52:24Z
Created attachment 172 Small test cae for the same problem in DMC
Comment #2 by smjg — 2007-08-29T13:03:13Z
The problem doesn't show if I use the Windows API (either WriteConsole or WriteFile) to output. So the bug must be somewhere in DM's stdio implementation.
Comment #3 by bugzilla — 2007-09-28T22:15:07Z
Fixed dmd 1.021 and 2.004
Comment #4 by mk — 2007-10-29T11:02:51Z
The problem was NOT fixed for stderr (DMD 1.022)
Comment #5 by mk — 2007-10-29T11:04:25Z
*** Bug 1608 has been marked as a duplicate of this bug. ***
Comment #6 by mk — 2008-09-03T10:57:24Z
I hope this gets fixed one day. Here is an updated example, where it still doesn't work (for stderr, stdout is ok) as of DMD 1.035 import std.c.stdio; import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP( UINT ); void main() { SetConsoleOutputCP( 65001 ); // or use "chcp 65001" instead // Codepoint 00e9 is "Latin small letter e with acute" fputs("Output utf-8 accented char \u00e9\n... and the rest is OK\n", stdout); fputs("Output utf-8 accented char \u00e9\n... and the rest is cut off!\n", stderr); fputs("STDOUT.\n", stdout); fputs("STDERR.\n", stderr); }
Comment #7 by kevin — 2012-02-07T22:48:48Z
Sort of works for me. The text doesn't get cut off, but the unicode characters don't get displayed either. C:\Users\Kevin\Documents\D Projects\ConsoleApp1\ConsoleApp1\bin>ConsoleApp1.exe Output utf-8 accented char é ... and the rest is OK Output utf-8 accented char �� ... and the rest is cut off! STDOUT. STDERR. C:\Users\Kevin\Documents\D Projects\ConsoleApp1\ConsoleApp1\bin>
Comment #8 by mk — 2013-03-19T18:21:18Z
Status update as of DMD 2.062 (Win XP 32 bit) Still the same error for the above mentioned example, however, when modified to use write instead of fputs: import std.stdio; import std.c.windows.windows; extern(Windows) BOOL SetConsoleOutputCP( UINT ); void main() { SetConsoleOutputCP( 65001 ); // or use "chcp 65001" instead stderr.write("STDERR:Output utf-8 accented char \u00e9\n... and the rest is cut off!\n"); stderr.write("end_STDERR.\n"); } I get this error: STDERR:Output utf-8 accented char é ... and the rest is cut off! std.exception.ErrnoException@D:\PROGRAMS\DMD2\WINDOWS\BIN\..\..\src\phobos\std\stdio.d(1264): (No error) ---------------- 0x0040D874 0x0040D6FF 0x00402218 0x00402189 0x00402121 0x00402030 0x0040354E 0x00403151 0x00402388 0x7C81776F in RegisterWaitForInputIdle ---------------- So if anybody have a clue what's going on there...
Comment #9 by ben — 2013-08-07T00:55:43Z
I can confirm this issue. When enumerating a directory (via dirEntries()) containing a file with a character in the CP850/CP1252 space (e.g. "säb"), depending on the codepage settings, the output is as follows: chcp 1252 => output is "säb" (Unicode encoding for "ä") chcp 65001 => output is "säbstd.exception.ErrnoException@D:\tools\d\bin\..\src\phobos\std\stdio.d(1352): (No error)" In both cases e.g. cmd's dir shows the correct results. The correct results are also shown when using - not really comparable - C with printf(). Tried the case in cmd, console2, and conemu. All show the same results. It'd really be nice if this bug would get fixed...
Comment #10 by ben — 2013-08-07T00:58:06Z
Addendum: Windows 7 64-bit, dmd v2.063.2. Sorry.
Comment #11 by mk — 2014-02-24T17:18:25Z
Hallelujah, this (comment 8) seems fixed, finally. Can anybody confirm ? Works for me on Windows XP 32 bit, dmd 2.065.0 Beware, fputs still doesn't work. I think it's C library problem.
Comment #12 by sum.proxy — 2014-10-25T09:26:49Z
The issue still exists in DMD32 D Compiler v2.065, Windows 7 ============== Code: ============== import std.stdio; import std.c.windows.windows; extern(Windows) BOOL SetConsoleOutputCP( UINT ); void main() { SetConsoleOutputCP( 65001 ); // or use "chcp 65001" instead stderr.write("STDERR:Output utf-8 accented char \u00e9\n... and the rest is cut off!\n"); stderr.write("end_STDERR.\n"); } ============== Output: ============== STDERR:Output utf-8 accented char é ... and the rest is cut off! ============== end_STDERR.\n is not written
Comment #13 by mk — 2016-02-09T21:07:53Z
Final note, as this is unlikely to be fixed: use -m32mscoff and Microsoft VS linker.
Comment #14 by mk — 2016-11-30T11:14:40Z
Partial fix or workaround in druntime for unhandled exceptions: https://github.com/dlang/druntime/pull/1687
Comment #15 by kinke — 2019-06-13T18:33:49Z
Still an issue, but apparently restricted to stderr (and independent from DigitalMars/MS runtime): ``` import core.stdc.stdio; import core.sys.windows.wincon, core.sys.windows.winnls; void main() { const oldCP = SetConsoleOutputCP(CP_UTF8); scope(exit) SetConsoleOutputCP(oldCP); fprintf(stdout, "HellöѬ LDC\n"); fflush(stdout); fprintf(stderr, "HellöѬ LDC\n"); fflush(stderr); } ``` => ``` HellöѬ LDC Hell ``` Tested with DMD 2.086.0 (-m32, -m32mscoff, -m64) and LDC on Win10.
Comment #16 by kinke — 2019-06-15T09:47:31Z
Update: it's working with Win10 v1903 (with the exact same binary that didn't work with v1803). According to Rainer Schütze, it's working since v1809. See https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/.
Comment #17 by razvan.nitu1305 — 2019-08-12T12:05:57Z
(In reply to kinke from comment #16) > Update: it's working with Win10 v1903 (with the exact same binary that > didn't work with v1803). According to Rainer Schütze, it's working since > v1809. See > https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and- > utf-8-output-text-buffer/. So is this issue fixed? I don't have a windows machine to test it. Should we close this?
Comment #18 by kinke — 2019-08-12T13:03:28Z
This isn't solved, but would now be solvable with recent Windows versions. There are 2 things about this: * DMD outputs a mix of UTF-8 and strings in the current codepage, AFAIK without setting any console codepage, so DMD output on Windows can be garbage. LDC v1.17 fixes this for LDC. * User programs writing UTF-8 strings to the console suffer from the same issue. This *could* be worked around by setting the console codepage in druntime's _d_run_main and resetting it to the original one before termination.
Comment #19 by razvan.nitu1305 — 2019-10-24T09:32:25Z
(In reply to kinke from comment #18) > This isn't solved, but would now be solvable with recent Windows versions. > > There are 2 things about this: > * DMD outputs a mix of UTF-8 and strings in the current codepage, AFAIK > without setting any console codepage, so DMD output on Windows can be > garbage. LDC v1.17 fixes this for LDC. How does LDC solve the problem? > * User programs writing UTF-8 strings to the console suffer from the same > issue. This *could* be worked around by setting the console codepage in > druntime's _d_run_main and resetting it to the original one before > termination.
Comment #20 by robert.schadek — 2024-12-13T17:47:58Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/17629 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB