Solr: Escape Special Character when Import Data


We are importing XML(CSV) data via curl Get request, in order to make it work, we need handle escape special characters: XML special Characters and URL special characters.

We need first escape XML special characters: & < > " ' to: & < > " '. In code, we can use org.apache.commons.lang.StringEscapeUtils.escapeXml(String).

Then we use code java.net.URLEncoder.encode(String, String) to escape URL special characters, especially $ & + , / : ; = ? @.
URLEncoder.encode will also convert new line feed(\r\n) to %0D%0A.

For example if filed content includes the following 2-lines data:
xml sepcail: & < > " '
url sepcail: $ & + , / : ; = ? @

The Curl Get request to import the data would be like below:
http://localhost:8080/solr/update?stream.body=<add><doc><field name="id">id1</field><field name="content">xml+sepcail%3A+%26amp%3B+%26lt%3B+%26gt%3B+%26quot%3B+%26apos%3B%0D%0Aurl+sepcail%3A+%24+%26amp%3B+%2B+%2C+%2F+%3A+%3B+%3D+%3F+%40</field></doc></add>&commit=true
Code to convert the XML field data
private String escapleXMLEncodeUrl(String str)
  throws UnsupportedEncodingException {
 String result= URLEncoder.encode(StringEscapeUtils.escapeXml(str), "UTF-8");
 return result;
} 
From org.apache.solr.client.solrj.util.ClientUtils.escapeQueryChars
We can know that we need escape(add \) the following special character for query string: \, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ?, |, &, ;, /, or whitespace.
Resources
Online XML Escape
Online URL Encoder/Decoder
RFC 1738: Uniform Resource Locators (URL) specification
http://www.xmlnews.org/docs/xml-basics.html

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)