Use Eclipse Display View While Debugging to Fix Real Problem


Today, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception.

We all know it's because there are invalid(format) characters in that line, but that line is too long, from the error log we can't easily determine which characters caused the problem.

SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt,can't read line: 12450
        values={NO LINES AVAILABLE}
        at org.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
     at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
So I enabled remote debug, added a breakpoint at  org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481), then re-post the csv file with only the header and that line.
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
    // save current line
    int startLineNumber = getLineNumber();
    for (;;) {
      c = in.read();
      if (c == '\\' && strategy.getUnicodeEscapeInterpretation() && 
      in.lookAhead()=='u') {
        tkn.content.append((char) unicodeEscapeLexer(c));
      } else if (c == strategy.getEscape()) {
        tkn.content.append((char)readEscape(c));
      } else if (c == strategy.getEncapsulator()) {
  ...
      } else if (isEndOfFile(c)) {
        // error condition (end of file before end of token)
        throw new IOException( // add a breakpoint here.
                "(startline " + startLineNumber + ")"
                        + "eof reached before encapsulated token finished"
        );
      } 
   ...
    }
  }
Now I know the value of character c is a, but this not enough.
I can step through the method until it hits the exception, but too many characters in that line, we don't know when we can hit the exception.

Fortunately, when we pause at a breakpoint, we can execute any code in the display view.

So we can enter the following line in display view:
return "" + String.valueOf((char)c) +  
    String.valueOf((char)in.read()) +  String.valueOf((char)in.read());
Then we select the code, and click "Display Result of Evaluating Selected Text", the output would be:
(java.lang.String) an,

Now, search "an," in the csv file, find it:
|  | "an,xxxx"@

Now the reason is obvious, the value is part of the from field.
From CSV standard are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"

For more info, please read Import CSV that Contains Double-Quotes into Solr

The fix is simple, just change the data to: |  | ""an,xxxx""@, now it works.
-- Next we need fix the code that generates the csv, but that's off topic.

The window in Eclipse looks like below:


Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)