← Back to index | Original Bugzilla link

Bug 18241 – Missing characters from std.uni.unicode.Default_Ignorable_Code_Point

Status: RESOLVED
Resolution: INVALID
Severity: normal
Priority: P1
Component: phobos
Product: D
Version: D2
Platform: x86_64
OS: Linux
Creation time: 2018-01-15T23:50:32Z
Last change time: 2018-01-16T00:13:30Z
Assigned to: No Owner
Creator: hsteoh

Comments

Comment #0 by hsteoh — 2018-01-15T23:50:32Z

The set returned by unicode.Default_Ignorable_Code_Point is missing some characters listed in: http://www.unicode.org/L2/L2002/02368-default-ignorable.pdf where Default_Ignorable_Code_Point is defined as: Other_Default_Ignorable_Code_Point + (Cf + Cc + Cs - White_Space) While characters in Other_Default_Ignorable_Code_Point seem to be included correctly, two characters in Cf appear to be missing from the set: - U+06DD - U+070F Furthermore, characters in (Cc - White_Space) are also missing: - U+0000 to U+0008 - U+000E to U+001F (See also: PR #5, referencing the Unicode Standard section 5.22.) Not sure if this is because these missing characters were added in a later Unicode standard than was originally implemented in std.uni.

Comment #1 by hsteoh — 2018-01-15T23:59:38Z

Actually, strike U+06DD and U+070F from the list. These are explicitly specified by TR 44 as being exceptional cases that should NOT be ignored, even though they belong to the Cf category.

Comment #2 by hsteoh — 2018-01-16T00:13:30Z

Actually, nevermind this bug. The file at http://www.unicode.org/L2/L2002/02368-default-ignorable.pdf is outdated; std.uni actually does obey the latest standard as given in ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt.