Bug 3465 – isIdeographic can be wrong in std.xml

Comment #0 by y0uf00bar — 2009-11-01T21:51:25Z

The std.xml functionisIdeographic failed my parser on one of the xml conformance tests for the character 0x4E00. // As implemented in XML Piece Parser Project, http://source.miryn.org/ // but I took it from std.xml //WRONG in std.xml //invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029]; //RIGHT, because for lookup function, // the table data range pairs should be ordered! dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5]; // PERFORMANCE SUGGESTION // also lookup is best done for tables that are larger // for smaller tables, like this one, or character, // surely a hard coded search will be faster // Surely not much more code, is generated for this. // and faster, since no function call to lookup, and no array slices used. bool isIdeographic(dchar c) { if (c == 0x3007) return true; if (c >= 0x3007 && c <= 0x3029) return true; if (c >= 0x4E00 && c <= 0x9FA5) return true; return false; } // Only suggestion here.. // isChar has to be called for every single character in the document, and // it must be worth a bit of optimisation, // especially for common cases. /** * Returns true if the character is a character according to the XML standard * Character references must refer to one of these. * Any unicode character, excluding surrogate blocks FFFE and FFFF. * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] * Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], * Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0) * * Params: * c = the character to be tested * The standard ASCII case gets at most 3 value comparisons. */ bool isChar(dchar c) { if (c <= 0xD7FF) { if (c >= 0x20) { if (c >= 0x7F) { if (c <= 0x84) return false; if (c >= 0x86) { if (c <= 0x9F) return false; } } return true; } switch(c) { case 0xA: case 0x9: case 0xD: return true; default: return false; } } else if (c >= 0xE000) { if (c < 0xFFFE) { if (c >= 0xFDD0 && c <= 0xFDEF) return false; return true; } if (c >= 0x10000) { if (c <= 0x10FFFF) { /* some conformance tests have the 0x10FFFF if ((c & 0xFFFE) == 0xFFFE) { return false; } */ return true; } } } return false; } // Most digits are expected to be ASCII ones bool isDigit(dchar c) { if (c <= 0x0039 && c >= 0x0030) return true; else return lookup(DigitTable,c); }

Comment #1 by y0uf00bar — 2009-11-01T21:58:11Z

// A check on my code indicates afternoon doziness, so here is the better version bool isIdeographic(dchar c) { if (c == 0x3007) return true; if (c <= 0x3029 && c >= 0x3021 ) return true; if (c <= 0x9FA5 && c >= 0x4E00) return true; return false; }

Comment #2 by rsinfu — 2010-05-23T21:36:54Z

Fixed in svn r1552. Thanks for your contribution! Excuse me: I removed certain part of your code from the actual commit. The contributed code took care of newer Unicode standards. I like new things, but as far as supporting XML 1.0, we have to stick to Unicode 2.0.

Bug 3465 – isIdeographic can be wrong in std.xml

Comments