Bug 10668 – Unicode characters, when taken from strings (as char), are not printed correctly

Status
RESOLVED
Resolution
INVALID
Severity
normal
Priority
P2
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Mac OS X
Creation time
2013-07-19T01:55:00Z
Last change time
2013-07-19T08:41:49Z
Assigned to
nobody
Creator
MATTCA

Attachments

IDFilenameSummaryContent-TypeSize
1234example.dA small program which demonstrates the issue.application/octet-stream187

Comments

Comment #0 by MATTCA — 2013-07-19T01:55:53Z
Created attachment 1234 A small program which demonstrates the issue. When obtaining a char from within a string of non-ASCII characters (in this example, the pound sign '£'), the resulting char will not be printed correctly to the console (via std.stdio.writeln). Instead, the '?' symbol is printed. However, when printing the entire string, the '£' is printed correctly.
Comment #1 by MATTCA — 2013-07-19T01:58:27Z
The content of the attachment, just in case: module main; import std.stdio; void main(string[] args) { string s = "£££"; writeln(s); // Output: £££ char c = s[0]; writeln(c); // Output: ? writeln(s[0]); // Output: ? }
Comment #2 by monarchdodra — 2013-07-19T02:41:47Z
Well... what did you think it was going to print? you have a utf-8 sequence. char c = s[0]; will extract the first code*point* of your unicode. You want the first code*unit*. http://www.fileformat.info/info/unicode/char/a3/index.htm EG: £ is the codepoint "AE" In UTF8 it is represented by the sequence: [0xC2, 0xA3] When you write "char c = s[0];", you are extracting the first codeunit, which is 0xC2. When you pass this to to writeln, what will happen will mostly depend on your locale/codepage. If it is set to UF8 (CP65001 on windows), then it will print the "unknown character", since it you passed an incomplete sequence. The correct code you want is: dchar c = s.front; (remember to include std.array to front). Another alternative, is to simply work from the ground up with dstrings. module main; import std.stdio; void main(string[] args) { dstring s = "£££"; writeln(s); // Output: £££ dchar c = s[0]; writeln(c); // Output: £ writeln(s[0]); // Output: £ } Do you have access to "The D Programming Language"? It has the best introduction to unicode/UTF I've read.
Comment #3 by nilsbossung — 2013-07-19T07:59:08Z
(In reply to comment #2) > Well... what did you think it was going to print? you have a utf-8 sequence. > char c = s[0]; will extract the first code*point* You mean code*unit*. > of your unicode. You want the first code*unit*. code*point*
Comment #4 by MATTCA — 2013-07-19T08:24:57Z
(In reply to comment #2) > Well... what did you think it was going to print? you have a utf-8 sequence. > char c = s[0]; will extract the first code*point* of your unicode. You want the > first code*unit*. > > http://www.fileformat.info/info/unicode/char/a3/index.htm > EG: £ is the codepoint "AE" > In UTF8 it is represented by the sequence: [0xC2, 0xA3] > > When you write "char c = s[0];", you are extracting the first codeunit, which > is 0xC2. When you pass this to to writeln, what will happen will mostly depend > on your locale/codepage. If it is set to UF8 (CP65001 on windows), then it will > print the "unknown character", since it you passed an incomplete sequence. > > The correct code you want is: > dchar c = s.front; > > (remember to include std.array to front). > > Another alternative, is to simply work from the ground up with dstrings. > > module main; > > import std.stdio; > > void main(string[] args) { > dstring s = "£££"; > writeln(s); // Output: £££ > > dchar c = s[0]; > writeln(c); // Output: £ > > writeln(s[0]); // Output: £ > } > > Do you have access to "The D Programming Language"? It has the best > introduction to unicode/UTF I've read. Thanks for the response! Yeah, I converted my project to use dstrings on the off chance it worked after posting, lo-behold this is the fix it seems. I plan on eventually getting the book, although I've read some bad reviews regarding the e-book/kindle version, so I'm having to wait a little longer to get a hard copy.
Comment #5 by monarchdodra — 2013-07-19T08:28:23Z
(In reply to comment #3) > (In reply to comment #2) > > Well... what did you think it was going to print? you have a utf-8 sequence. > > char c = s[0]; will extract the first code*point* > > You mean code*unit*. > > > of your unicode. You want the first code*unit*. > > code*point* Oops. Massive face-palm. Thank you for correcting me.
Comment #6 by monarchdodra — 2013-07-19T08:41:49Z
(In reply to comment #4) > Thanks for the response! Yeah, I converted my project to use dstrings on the > off chance it worked after posting, lo-behold this is the fix it seems. > > I plan on eventually getting the book, although I've read some bad reviews > regarding the e-book/kindle version, so I'm having to wait a little longer to > get a hard copy. I'd recommend trying to get your project to work with "normal UTF8" strings. They're the norm in D, and you'll have to get around to understanding how they work sooner or later. To make it *really* simple, a UTF-8 string should be handled like a bidirectional range of dchars. You can ask for front/back, popFront/popBack, and empty. Stick to only these primitives, and your code is *guaranteed* to work. All the other primitives (length, index, slice), while *present* require much more knowledge of what is going on, and should be used only when you *know* what you are doing. As a matter of fact, if you ask a string if it supports, say length: "hasLength!string": it will say "false".