Programmer: Lifelong Learning: Removing Invalid Control Characters from XML

The Problem
Today, our application reports following error when import one xml data to Solr:
<str name="msg">Illegal character ((CTRL-CHAR, code 22)) at [row,col {unknown-source}]: [1,115]</str>

Via Eclipse remote debug, I got the xml data client sent:
<field name="lnks">\MB\abc def....</field>
It's obvious that this is because of the invalid Control Character "synchronous idle" character in the XML data.

Putting the field value in CDATA tag won't help in this case, as from http://msdn.microsoft.com/en-us/library/ms256076.aspx:
Content within CDATA sections must be within the range of characters permitted for XML content; control characters and compatibility characters cannot be escaped this way. In addition, the sequence ]]> cannot appear within a CDATA section because this sequence signals the end of the section. This means that CDATA sections cannot be nested. The sequence also appears in some scripts. Within scripts, it is usually possible to substitute] ]> for ]]>.

The Fix
To fix this, we have to remove these special control characters from xml data or replace them with other character like space or dash.
We can do this easily at client side: just write a function to replace or remove these special control characters or reuse existing library, such as Guva
CharMatcher.JAVA_ISO_CONTROL.removeFrom(string);
str = str.replaceAll("\\p{Cntrl}", "");
or str = str.replaceAll("[\\p{Cntrl}^\r\n\t]+", "");

Explanation

CharMatcher.JAVA_ISO_CONTROL characters are in range '\u0000' - '\u001F' and '\u007F' - '\u009F'. They are invisible and invlaid in XML, and have no meaning in text processing.

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/CharMatcher.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html?is-external=true#isISOControl(char)

Determines if the referenced character (Unicode code point) is an ISO control character. A character is considered to be an ISO control character if its code is in the range '\u0000' through '\u001F' or in the range '\u007F' through '\u009F'.

http://en.wikipedia.org/wiki/Valid_characters_in_XML

The code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.

U+0001–U+0008, U+000B–U+000C, U+000E–U+001F : this includes most (not all) C0 control characters

U+007F–U+0084, U+0086–U+009F : this includes a C0 control character, and all but one C1 control

Resources
http://msdn.microsoft.com/en-us/library/ms256076.aspx
http://stackoverflow.com/questions/14028716/how-to-remove-control-characters-from-java-string
http://unicode-table.com/en/

Removing Invalid Control Characters from XML

Labels