← Back to index | Original Bugzilla link

Bug 7054 – format() aligns using code units instead of graphemes

Status: RESOLVED
Resolution: FIXED
Severity: normal
Priority: P4
Component: phobos
Product: D
Version: D2
Platform: All
OS: All
Creation time: 2011-12-02T11:00:52Z
Last change time: 2020-03-21T03:56:37Z
Keywords: bootcamp, pull
Assigned to: No Owner
Creator: Brad L
Depends on: 13348

Comments

Comment #0 by brad.lanam.comp — 2011-12-02T11:00:52Z

Using a width specifier in conjunction with a utf string fails. It would be nice to have a %S modifier that handles this situation. 123456789012345 Größe: incorrect Größe: correct Größe : incorrect Größe : correct Grö : incorrect Größ : correct import std.stdio; import std.string; import std.utf : count; void main (string[] args) { string t = "Größe"; size_t width = 10; auto len = t.length; auto utflen = std.utf.count (t); auto tlen = width + len - utflen; writefln ("123456789012345"); writefln ("%10s: incorrect", t); writefln ("%*s: correct", tlen, t); writefln ("%-10s: incorrect", t); writefln ("%-*s: correct", tlen, t); writefln ("%-10.4s: incorrect", t); auto fmt = format ("%%-%d.%ds: correct", tlen, tlen - 6); writefln (fmt, t); }

Comment #1 by smjg — 2011-12-04T18:08:04Z

(In reply to comment #0) > Using a width specifier in conjunction with a utf string fails. It would be > nice to have a %S modifier that handles this situation. We shouldn't need a different format specifier for this. It's a matter of what "Width" means in the format string spec, not one of different kinds of string format. http://www.d-programming-language.org/phobos/std_format.html "Specifies the minimum field width. If the width is a *, the next argument, which must be of type int, is taken as the width. If the width is negative, it is as if the - was given as a Flags character." While it doesn't specify the units the width is measured in, it seems a reasonable assumption that it is intended to be characters, rather than bytes. Indeed, 1.071's version gets it right, so clearly the bug is in the D2 line's implementation.

Comment #2 by bugzilla — 2012-01-24T01:50:59Z

This is a bug in Phobos, not the compiler or spec.

Comment #3 by hsteoh — 2014-08-21T04:49:12Z

Tried to fix this today, unfortunately it's blocked by std.uni.byGrapheme being impure, which causes a ripple of impurity down the call chain causing several unittest compile errors and CTFE errors.

Comment #4 by dmitry.olsh — 2014-08-29T21:19:21Z

(In reply to hsteoh from comment #3) > Tried to fix this today, unfortunately it's blocked by std.uni.byGrapheme > being impure, which causes a ripple of impurity down the call chain causing > several unittest compile errors and CTFE errors. Why should it call byGrapheme? Doesn't seem likly that we are doing grapheme clustering only to output some damn text.

Comment #5 by hsteoh — 2014-08-29T21:31:59Z

Because grapheme clustering is the only sane way to handle output to a field of fixed length. For example, writeln("%5s", "a\u0301") should treat "a\u0301" as occupying only a single position in the 5-position wide output field. Any other solution would introduce further problems, e.g. if we count code points instead, then the width field in the format string would be basically useless (the caller will have to manually count output positions -- with byGrapheme -- and adjust the width accordingly). Furthermore, it would introduce more special cases (precomposed characters will format differently from base char + combining diacritic; non-spacing characters will consume field width but occupy no space in the actual output, etc.).

Comment #6 by Marco.Leise — 2016-02-09T18:39:10Z

Graphemes work until you meet full-width characters. Ｇｒａｐｈｅｍｅｓｗｏｒｋｕｎｔｉｌｙｏｕｍｅｅｔｆｕｌｌ－ｗｉｄｔｈｃｈａｒａｃｔｅｒｓ． From Wikipedia: "With fixed-width fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name." https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms We need UTF decoding, grapheme clustering, character categorizing, super-cow-power width specifiers in our writeln.

Comment #7 by hsteoh — 2016-02-12T01:02:59Z

Argh. Welcome to Unicode, where exceptions *are* the norm, and no simple algorithm is simple in practice. And this is a double-argh, because when it comes to double-width characters, whether or not the output will even *look* right depends on what kind of terminal you're using, and how it handles double-width characters. Older terminals may not recognize double-width characters, and such characters may end up formatted as if they were single-width. (But then again, such terminals will already make a big unreadable mess of double-width characters anyway, so perhaps it's not so important to cater to them.) But once you start down this slippery slope, the next thing that will come up is making `writefln` support right-to-left text, then vertical text, etc., and before you know it, we'll be reinventing libpango except poorly (and for a text terminal where it's questionable whether such things are even relevant anymore).

Comment #8 by smjg — 2016-02-12T13:23:38Z

(In reply to Marco Leise from comment #6) > https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms So "halfwidth" means the width of a character cell, and "fullwidth" means double that width. Seems counter-intuitive. I would have expected them to be something like "singlewidth" and "doublewidth" respectively. So there a few different units at work here: - code units - codepoints - graphemes - width units A further complication is whether formattedWrite should be geared towards text terminals, writing data to a text file designed for human reading, writing data to a text file that follows a rigid format for machine processing or what. So it looks like there's no simple solution. But in 99% of cases, using code units (as it does at the moment) is bound to be wrong.

Comment #9 by Marco.Leise — 2016-02-15T12:07:58Z

I always regarded it as merely a means to print stuff with a non-proportional font for humans to read that extends to text files. The match up of bytes and visual characters in the early days printf is only a historical coincidence. Most terminals - like programming languages and GUI toolkits - have to adapt to the Unicode reality and I believe it is safe to assume that when someone calls writefln or format with full-width symbols they use a terminal that can handle them. The popular VTE library used by many recent Linux terminal emulators works great for example. That said, printf is no better, and we could just claim that the width is meant to mean bytes or ASCII characters and you are supposed to use writefln only for English text in debugging output and not user interaction. std.stdio never cared about the user locale anyways. For all we know the output terminal might expect KOI-8 (Cyrillic) or some Indian script. In Java for example you are supposed to use an encoding wrapper if your stdout goes to a terminal, IIRC. But as Unicode is kind of ubiquitous now, we might as well say that Dlang only works on Unicode enabled systems. Sorry for the derail ... :)

Comment #10 by hsteoh — 2016-02-18T00:22:58Z

Even if we concede that modern terminals ought to be Unicode-aware (if not fully supporting Unicode), there is still the slippery slope of how to print bidirectional text, vertical text, scripts that require glyph mutation, etc.. Where does one draw the line as to what writefln ought/ought not handle?

Comment #11 by smjg — 2016-02-18T13:30:59Z

What would supporting bidirectional text entail, exactly? It seems to me it's the job of the terminal to render characters in the correct order....

Comment #12 by Marco.Leise — 2016-02-24T14:38:34Z

(In reply to hsteoh from comment #10) > Even if we concede that modern terminals ought to be Unicode-aware (if not > fully supporting Unicode), there is still the slippery slope of how to print > bidirectional text, vertical text, scripts that require glyph mutation, > etc.. Where does one draw the line as to what writefln ought/ought not > handle? I tend to think like Steward. If I was using a script other than Latin, Cyrillic and similarly simple scripts I would most likely expect writefln's output on a terminal to look like when I print a text file of the same script to the terminal. Mixing vertical and horizontal text on a terminal is painfully hard and my expectation is that there is at most an option to render either horizontally or vertically (transposed). In that case "minimal width" would become "minimal height" and we are out of trouble. What exactly do you mean by glyph mutation? In most cases it is probably a task for the text layout engine the terminal uses. In other cases the user of writefln should be aware of how their script will display on a terminal and prepare their text accordingly before printing. There is no simple way to make plurals work in all languages either: http://localization-guide.readthedocs.org/en/latest/l10n/pluralforms.html Is that comparable to what you had in mind?

Comment #13 by hsteoh — 2018-01-08T15:52:49Z

*** Issue 18205 has been marked as a duplicate of this issue. ***

Comment #14 by b2.temp — 2018-01-09T12:19:48Z

pull: https://github.com/dlang/phobos/pull/6008

Comment #15 by github-bugzilla — 2018-01-13T08:37:05Z

Commits pushed to master at https://github.com/dlang/phobos https://github.com/dlang/phobos/commit/f9058bce6155b7b153c86fbeff06ba5b9ade5335 fix issue 7054 - format() aligns using code units instead of graphemes https://github.com/dlang/phobos/commit/2c0adf01bb9c2337841b4b248f31c3f4772030db Merge pull request #6008 from BBasile/issue-18205 fix issue 7054 - format() aligns using code units instead of graphemes

Comment #16 by hsteoh — 2018-01-13T13:48:50Z

There still remains the following cases that need to be handled: - Zero-width characters such as U+200B should not add to the width of the string; - Wide / Full-width characters as defined by Unicode TR11 (EastAsianWidth.txt) should occupy 2 spaces per character, as this is what is done in many monospace terminal applications; - Hangul Jamo syllables, while correctly segmented as single graphemes by graphemeStride, are designated as wide characters, and thus should occupy 2 spaces per grapheme (note that there can be multiple dchars per Jamo grapheme).

Comment #17 by hsteoh — 2018-01-13T16:44:21Z

Related: issue #17810.

Comment #18 by greensunny12 — 2018-02-19T06:58:16Z

> There still remains the following cases that need to be handled: I opened respective issues for these, s.t. it's easier to track progress and that this issue appears on the changelog (after all it has been fixed for most use cases and the one in this bug report). Also it's generally good to inform people of changes that have occurred in Phobos because of PR (things are moving forward) and quick reference in case they run into regressions. https://issues.dlang.org/show_bug.cgi?id=18465 https://issues.dlang.org/show_bug.cgi?id=18466 https://issues.dlang.org/show_bug.cgi?id=18467