Bug 18241 – Missing characters from std.uni.unicode.Default_Ignorable_Code_Point

Status
RESOLVED
Resolution
INVALID
Severity
normal
Priority
P1
Component
phobos
Product
D
Version
D2
Platform
x86_64
OS
Linux
Creation time
2018-01-15T23:50:32Z
Last change time
2018-01-16T00:13:30Z
Assigned to
No Owner
Creator
hsteoh

Comments

Comment #0 by hsteoh — 2018-01-15T23:50:32Z
The set returned by unicode.Default_Ignorable_Code_Point is missing some characters listed in: http://www.unicode.org/L2/L2002/02368-default-ignorable.pdf where Default_Ignorable_Code_Point is defined as: Other_Default_Ignorable_Code_Point + (Cf + Cc + Cs - White_Space) While characters in Other_Default_Ignorable_Code_Point seem to be included correctly, two characters in Cf appear to be missing from the set: - U+06DD - U+070F Furthermore, characters in (Cc - White_Space) are also missing: - U+0000 to U+0008 - U+000E to U+001F (See also: PR #5, referencing the Unicode Standard section 5.22.) Not sure if this is because these missing characters were added in a later Unicode standard than was originally implemented in std.uni.
Comment #1 by hsteoh — 2018-01-15T23:59:38Z
Actually, strike U+06DD and U+070F from the list. These are explicitly specified by TR 44 as being exceptional cases that should NOT be ignored, even though they belong to the Cf category.
Comment #2 by hsteoh — 2018-01-16T00:13:30Z
Actually, nevermind this bug. The file at http://www.unicode.org/L2/L2002/02368-default-ignorable.pdf is outdated; std.uni actually does obey the latest standard as given in ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt.