Bug 3193 – Support Windows-1251 as a source encoding

Status
RESOLVED
Resolution
WONTFIX
Severity
enhancement
Priority
P2
Component
dmd
Product
D
Version
D2
Platform
x86
OS
Windows
Creation time
2009-07-20T04:01:00Z
Last change time
2015-06-09T05:14:59Z
Assigned to
nobody
Creator
ok96

Attachments

IDFilenameSummaryContent-TypeSize
428russianstdout.pngThis screenshot is from Chris Millerimage/png38560
429privyet.pngCorrect Russian outputimage/png12300
431D Windows Unicode text.jpgD Windows Unicode text - edit in Notepad and compile result in Consoleimage/jpeg53737
432D UTF-8 text.jpgD UTF-8 text - edit in Notepad and compile result in Consoleimage/jpeg57257
433D ANSI text.jpgD ANSI text with Russian letters - edit in Notepad and compile result in Consoleimage/jpeg62817

Comments

Comment #0 by ok96 — 2009-07-20T04:01:54Z
If you compile hello.d example with Russian Win1251 charecters in this line: printf("Привет, D!\n"); dmd.exe reports an error: D:\Apps\Prog_D\dmd\samples\d>dmd hello.d hello.d(5): invalid UTF-8 sequence hello.d(5): invalid UTF-8 sequence hello.d(5): invalid UTF-8 sequence hello.d(5): invalid UTF-8 sequence hello.d(5): invalid UTF-8 sequence hello.d(5): invalid UTF-8 sequence If you save hello.d in UTF-8, then anyway dmd.exe compiles it wrong (see http link).
Comment #1 by jarrett.billingsley — 2009-07-20T06:09:18Z
The compiler does not understand Windows-1251, so this is according to spec. However, you say the compiler compiles it wrong if it's in UTF-8; where's the link?
Comment #2 by ok96 — 2009-07-20T06:13:25Z
Created attachment 428 This screenshot is from Chris Miller
Comment #3 by jarrett.billingsley — 2009-07-20T07:52:38Z
Sorry, this is invalid. To solve this, you have to do the following: 1) Set cmd.exe's font to Lucida Console. 2) Execute 'chcp 65001'. Then run your program.
Comment #4 by jarrett.billingsley — 2009-07-20T07:53:22Z
Created attachment 429 Correct Russian output Here's an image that shows it working properly.
Comment #5 by ok96 — 2009-07-20T22:28:13Z
But Jarrett, almost everybody who codes in Russian needs Windows-1251 codepage by default. If we need to compile small program and we don't have robist IDE we use notapad.exe (or something like this) that saves Russian text in Windows-1251. And nobody will be changing his dafault font in "Command Prompt" to Lucida Console only for my small program - I swear you! Any other compilers (Pascal, C, C++) understand that the Russian text in Windows is in Windows-1251! Currently I dont have any good editor for D whare I can normally edit Russian texts in UTF-8. Entice Designer has a bug confirmed by Chris Miller - you cannot enter Russian text, only copy and paste. Therefore if you build a D compiler for Win32 platform, you have make it work with widely used regional codepages. Because the entire world is not English only and fully not UTF-8!
Comment #6 by jarrett.billingsley — 2009-07-20T23:11:34Z
What you're basically asking for is an enhancement. I'm sorry, but that's the way it works.
Comment #7 by smjg — 2009-07-21T14:43:27Z
Why not make this enhancement request "Write a decent, free, Unicode-compatible code editor that syntax-highlights D properly"?
Comment #8 by jarrett.billingsley — 2009-07-21T17:07:58Z
(In reply to comment #7) > Why not make this enhancement request "Write a decent, free, Unicode-compatible > code editor that syntax-highlights D properly"? Why not be a sarcastic ass _all the time_?
Comment #9 by ok96 — 2009-07-22T01:19:42Z
Dear friends, D has really good ideas behind its face, but Unicode support (UTF-16) in the compiler instead of old UTF-8 is "MUST HAVE" feature. Its a great need in non-Latin languages. Windows-1251 codepage is #1 for Russian Windows programmers. Otherwise D compiler will stay an experiment forever.
Comment #10 by smjg — 2009-07-22T01:47:17Z
(In reply to comment #9) > Dear friends, D has really good ideas behind its face, but Unicode support > (UTF-16) in the compiler instead of old UTF-8 is "MUST HAVE" feature. DMD already supports UTF-16. Even UTF-32. Why do you want UTF-8 support removed? > Its a great need in non-Latin languages. Windows-1251 codepage is #1 > for Russian Windows programmers. Otherwise D compiler will stay an > experiment forever. How would supporting codepages work anyway? Would they be converted to UTF-8 at compiletime? In this case, D would need some form of character encoding declaration. Or would they be left as are, and be rejected only in wchar, wchar[], dchar and dchar[] literals? What about all the D features and APIs that rely on char[] being UTF-8? Seriously, if you're going to code in D and need to use non-ASCII characters, it goes without saying that you should have a Unicode-compatible editor. The lack of good D editors may be a real issue at the moment, but AISI it makes little sense to try to work around it. No programming language is born with high-quality development tools. People need to write them. (That said, there have been a few dedicated D IDE projects. What's the highest stage of development any of them is at?)
Comment #11 by ok96 — 2009-07-22T01:57:48Z
Stewart, Windows compilers SHOULD understand and correctly convert regional characters for console and dialogs (from resource files). The simplest test for the compiler in Windows is to enter text in notepad.exe in regional language and try to compile the file. MS VCPP compiler, BCC compiler and any other C++ compiler do it. And if DMD supports UTF-16 then how to make it work with UTF-16 Russian text entered in the simplest Notepad editor?
Comment #12 by bugzilla — 2009-07-22T02:32:23Z
> And if DMD supports UTF-16 then how to make it work with UTF-16 Russian text > entered in the simplest Notepad editor? DMD will automatically detect and work correctly with UTF-16 and UTF-32 encoded source files. The logic to do this is in module.c of the compiler source code. If it does not work with a particular UTF-16 encoded file, please attach that file to this bug report. Note that UTF-16 encoded files are not encoded using a code page. If a source file is encoded with a particular code page, there is no way for the compiler to automatically detect it. C compilers often have a command line flag which is used to tell it what code page to use. Using code pages, therefore, makes your source code completely non-portable which is one of the reasons why D uses Unicode instead.
Comment #13 by ok96 — 2009-07-22T03:42:38Z
Created attachment 431 D Windows Unicode text - edit in Notepad and compile result in Console
Comment #14 by ok96 — 2009-07-22T03:43:10Z
Created attachment 432 D UTF-8 text - edit in Notepad and compile result in Console
Comment #15 by ok96 — 2009-07-22T03:44:11Z
Created attachment 433 D ANSI text with Russian letters - edit in Notepad and compile result in Console
Comment #16 by ok96 — 2009-07-22T03:51:47Z
Dear Walter, Please take a close look at my last 3 attachements having "edit in Notepad and compile result in Console" text in descriptions. Note that all Russians have 866 codepage by default in Windows Command Prompt. Nobody will be switching 866 to any other codepage for console application.
Comment #17 by smjg — 2009-07-22T04:32:31Z
Going by your screenshots and their descriptions, DMD is behaving correctly. I do, however, feel that D's stdio ought to support codepages(In reply to comment #16) > Dear Walter, > Please take a close look at my last 3 attachements having "edit in Notepad and > compile result in Console" text in descriptions. Going by your screenshots and their descriptions, DMD is behaving correctly. > Note that all Russians have 866 codepage by default in Windows Command Prompt. You mean it's hard-coded for each language's edition of Windows? That's something else that ought to change. > Nobody will be switching 866 to any other codepage for console application. Console output is an entirely separate issue from source encoding. I feel that D's stdio ought to support codepages, but it doesn't (aside from the fact that printf isn't part of D's stdio). Meanwhile, please check out my utility library http://pr.stewartsplace.org.uk/d/sutil/
Comment #18 by andrei — 2009-07-24T18:10:55Z
I think support for codepages and other character types could be implemented in a library. That was the ambitious purpose behind std.encoding. Yet another great project for someone interested.
Comment #19 by dfj1esp02 — 2009-07-27T02:39:21Z
As to console output, it's a duplicate of (runtime) bug 2742 or bug 1448. Tango and C API work correctly, phobos doesn't. As to cp1251, this ice age technology is definitely not a way to go, unicode is a future. No, it's the present. Windows works in unicode and you should use it. As to convertion of source from ANSI to OEM codepage, it's valid RFE, but hardly one will implement it. You can try yourself.
Comment #20 by smjg — 2009-07-27T03:34:29Z
(In reply to comment #19) > As to console output, it's a duplicate of (runtime) bug 2742 or bug 1448. This is getting OT for this bug report, but it's 2742 to which what this conversation has drifted into is related. 1448 is a separate issue. > As to convertion of source from ANSI to OEM codepage, it's valid > RFE, but hardly one will implement it. You can try yourself. I already have. See comment 17.
Comment #21 by dfj1esp02 — 2009-07-28T02:37:01Z
Hmm... your library is just an API, it has nothing to do with source encoding and as far as I see it accepts utf8 text, not ANSI.
Comment #22 by smjg — 2009-07-28T03:30:01Z
(In reply to comment #21) > Hmm... your library is just an API, it has nothing to do with source encoding As has a lot of the discussion here from comment 13 onwards. Maybe, to avoid confusion, we should continue this conversation at bug 2742. Or perhaps even better, on the newsgroup. > and as far as I see it accepts utf8 text, not ANSI. Not quite. It communicates with the console in the console codepage. Application code communicates with it in UTF-8.
Comment #23 by chalucha — 2015-04-09T18:32:53Z
I came across this and think, that this can be closed already. Unicode source files works, I don't think other encodings for source files are required anymore. Console output is another story - discussed elsewhere