Removing Invalid Control Characters from XML


The Problem
Today, our application reports following error when import one xml data to Solr:
<str name="msg">Illegal character ((CTRL-CHAR, code 22)) at [row,col {unknown-source}]: [1,115]</str>

Via Eclipse remote debug, I got the xml data client sent:
<field name="lnks">\MB\abc def....</field>
It's obvious that this is because of the invalid Control Character "synchronous idle" character in the XML data.

Putting the field value in  CDATA tag won't help in this case, as from http://msdn.microsoft.com/en-us/library/ms256076.aspx:
Content within CDATA sections must be within the range of characters permitted for XML content; control characters and compatibility characters cannot be escaped this way. In addition, the sequence ]]> cannot appear within a CDATA section because this sequence signals the end of the section. This means that CDATA sections cannot be nested. The sequence also appears in some scripts. Within scripts, it is usually possible to substitute] ]> for ]]>.

The Fix
To fix this, we have to remove these special control characters from xml data or replace them with other character like space or dash.
We can do this easily at client side: just write a function to replace or remove these special control characters or reuse existing library, such as Guva
CharMatcher.JAVA_ISO_CONTROL.removeFrom(string);
str = str.replaceAll("\\p{Cntrl}", "");
or str = str.replaceAll("[\\p{Cntrl}^\r\n\t]+", "");

Explanation

CharMatcher.JAVA_ISO_CONTROL characters are in range '\u0000' - '\u001F' and '\u007F' - '\u009F'. They are invisible and invlaid in XML, and have no meaning in text processing.

Determines if the referenced character (Unicode code point) is an ISO control character. A character is considered to be an ISO control character if its code is in the range '\u0000' through '\u001F' or in the range '\u007F' through '\u009F'.

The code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.
U+0001–U+0008, U+000B–U+000C, U+000E–U+001F : this includes most (not all) C0 control characters
U+007F–U+0084, U+0086–U+009F  : this includes a C0 control character, and all but one C1 control

Resources
http://msdn.microsoft.com/en-us/library/ms256076.aspx
http://stackoverflow.com/questions/14028716/how-to-remove-control-characters-from-java-string
http://unicode-table.com/en/

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)