Bug 12990 – utf8 string not read/written to windows console

Status
NEW
Severity
normal
Priority
P3
Component
phobos
Product
D
Version
D2
Platform
All
OS
Windows
Creation time
2014-06-25T13:48:23Z
Last change time
2024-12-01T16:21:32Z
Assigned to
No Owner
Creator
Sum Proxy
See also
https://issues.dlang.org/show_bug.cgi?id=1448, https://issues.dlang.org/show_bug.cgi?id=24001
Moved to GitHub: phobos#9636 →

Comments

Comment #0 by sum.proxy — 2014-06-25T13:48:23Z
import std.stdio; void main() { string s = stdin.readln(); write(s); } The code above should write a unicode (specifically cyrillic) string to output to a windows console (with cp set to 65001), but the string comes out empty. The same code works correctly when run through windows debugger windbg.exe, so hopefully it will be an easy fix.
Comment #1 by dfj1esp02 — 2014-06-27T14:59:07Z
This bug is probably better to split. It either read an invalid utf-8 string, or couldn't write a valid utf-8 string.
Comment #2 by dfj1esp02 — 2014-06-27T15:02:18Z
import std.stdio, std.utf; void main() { string s = stdin.readln(); validate(s); write(s); } Check if validation passes.
Comment #3 by sum.proxy — 2014-06-27T15:27:51Z
I still see no output in the regular console (no exception indication either). However, when I run it with windbg.exe it throws some exception (can't tell which one exactly, couldn't figure out how to load debug symbols). Appears like a write problem to me..
Comment #4 by dfj1esp02 — 2014-06-27T18:56:55Z
Then try write(cast(ubyte[])s);
Comment #5 by sum.proxy — 2014-06-28T07:12:22Z
This time it returned an empty array ([]). Thanks.
Comment #6 by sum.proxy — 2014-07-03T08:02:07Z
I also tried it on a 32-bit windows system and the behavior is the same - no output.
Comment #7 by dfj1esp02 — 2014-07-07T09:00:29Z
An empty array means no input rather than no output. Did it wait for the input? Do you compile it for console or GUI subsystem? echo 000 | yourprogram.exe Does this work?
Comment #8 by sum.proxy — 2014-07-07T09:38:18Z
Yes, it does wait for the input, but the output is empty. It's a console application and sending the input through pipe seems to work correctly.
Comment #9 by sum.proxy — 2014-08-15T11:34:25Z
Sorry, any feedback on this one?
Comment #10 by dlang-bugzilla — 2014-10-25T02:10:16Z
Try calling SetConsoleCP(65001) and SetConsoleOutputCP(65001).
Comment #11 by sum.proxy — 2014-10-25T10:41:22Z
I tried the new version of the compiler with the issue you referred to, but alas - no luck. Please see https://issues.dlang.org/show_bug.cgi?id=1448#c12 SetConsoleCP(65001) and SetConsoleOutputCP(65001) didn't help either. Thanks.
Comment #12 by dlang-bugzilla — 2014-10-25T13:51:34Z
Indeed. Happens with both DMC and MSVC runtime.
Comment #13 by dlang-bugzilla — 2014-10-25T13:53:32Z
"scanf" misbehaves in the same way. Not a D bug, I think.
Comment #14 by sum.proxy — 2014-10-25T14:35:30Z
Do you find it necessary to report the issue elsewhere, or the guys in charge of https://issues.dlang.org/show_bug.cgi?id=1448 will do it?
Comment #15 by dlang-bugzilla — 2014-10-25T14:42:32Z
Report it where? To Microsoft? Figuring out why scanf is failing would probably be the next step to resolving this.
Comment #16 by sum.proxy — 2014-10-25T14:50:12Z
Are you referring to C's scanf? Is it consistently reproducible in a small chunk of C code?
Comment #17 by dlang-bugzilla — 2014-10-25T15:01:25Z
Yep: /////////// test.c /////////// void main() { char buf[1024]; SetConsoleCP(65001); SetConsoleOutputCP(65001); scanf("%s", buf); printf("%d", strlen(buf)); } //////////////////////////////
Comment #18 by sum.proxy — 2014-10-25T20:25:45Z
From what I know this program will work incorrectly for any non-ascii unicode input, which I have confirmed through simple tests. scanf and strlen rely on '\0' to indicate string termination, but I don't think this goes well with unicode strings. I believe the right way to do something similar (without buffer length) is this: #include <stdio.h> #include <fcntl.h> #include <io.h> int main( void ) { wchar_t buf[1024]; _setmode( _fileno( stdin ), _O_U16TEXT ); _setmode( _fileno( stdout ), _O_U16TEXT ); wscanf( L"%ls", buf ); wprintf( L"%s", buf ); } For further info please refer to http://www.siao2.com/2008/03/18/8306597.aspx and http://msdn.microsoft.com/en-us/library/tw4k6df8%28v=vs.120%29.aspx HTH, Thanks.
Comment #19 by dlang-bugzilla — 2014-10-26T00:35:23Z
(In reply to Sum Proxy from comment #18) > scanf and strlen rely on '\0' to indicate string termination, but I don't > think this goes well with unicode strings. Not true. At least, not true with UTF-8, which is what we set the CP to. > I believe the right way to do something similar (without buffer length) is > this: I would not say that's the "right" way. That's the way to read wchar_t text, but we need UTF-8 text.
Comment #20 by sum.proxy — 2014-10-28T11:32:14Z
I believe the problem is that default internal representation of Unicode in Windows is UTF-16, which implies that some sort of conversion would be necessary here. I haven't found a way to do it right yet.
Comment #21 by sum.proxy — 2014-10-28T12:28:55Z
Or perhaps "the right" way would be to stick to UTF-16, since it's default for Unicode in Windows.
Comment #22 by sum.proxy — 2014-10-28T12:53:37Z
This actually works on my system: ///////////// test.d ////////////// import std.stdio; import std.c.windows.windows; extern(Windows) BOOL SetConsoleCP( UINT ); void main() { SetConsoleCP(1200); string s = stdin.readln(); write(s); } ///////////////////////////////////
Comment #23 by bugzilla — 2023-06-22T08:10:47Z
(In reply to Sum Proxy from comment #22) > SetConsoleCP(1200); 1200 utf-16 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
Comment #24 by robert.schadek — 2024-12-01T16:21:32Z
THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/phobos/issues/9636 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB