Programmer: Lifelong Learning: January 2014

Practical Guide to Using JMX Platform MBeans

JMX allows us to monitor local or remote applications. We can use it to detect memory and thread usage, generate heap dump etc.

Local Monitoring
We can call ManagementFactory.getXXMXBean() to get a platform MBean running in the same Java virtual machine.

Remote Monitoring: Using an MXBean Proxy
First, remote application must enable monitoring and management from remote systems, password authentication and ssl is enabled, we may set system property to disable it. We can add the following VM arguments when start remote application:
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999

Then we can create an MXBean proxy to access a remote MBeanServe.

JMXServiceURL jmxUrl = new JMXServiceURL(
  "service:jmx:rmi:///jndi/rmi://remoteServer:9999/jmxrmi");
  JMXConnector jmxConn = JMXConnectorFactory.connect(jmxUrl);
  MBeanServerConnection mbsc = jmxConn.getMBeanServerConnection();
  MemoryMXBean mbean = ManagementFactory.newPlatformMXBeanProxy(mbsc,
  ManagementFactory.MEMORY_MXBEAN_NAME, MemoryMXBean.class);

Using RuntimeMXBean
RuntimeMXBean can be used to get VM name, version, vendor name, the application class path, input arguments, system properties, up time and start time of the application.

One pratical example is to get the process id of the application.

public static long getPID() {
    RuntimeMXBean runtimeMXBean = ManagementFactory.getRuntimeMXBean();
    String processName = runtimeMXBean.getName();
    return Long.parseLong(processName.split("@")[0]);
  }

Using ThreadMXBean
ThreadMXBean can be used to get application's thread information, such as get current live threads, peak thread count, total number of threads created, all live thread ids, thread info of one thread, cpu, user time of one thread.

One practical example is to use ThreadMXBean to check whether one thread from third party library is running, we may do or don't do something accordingly.

public static boolean isThreadRunning(String threadName) {
    boolean rhreadRunning = false;
    ThreadMXBean threadMX = ManagementFactory.getThreadMXBean();
    long[] tids = threadMX.getAllThreadIds();
    ThreadInfo[] tinfos = threadMX.getThreadInfo(tids, Integer.MAX_VALUE);
    for (ThreadInfo ti : tinfos) {
      String tName = ti.getThreadName();
      if (tName.startsWith(threadName)) {
        rhreadRunning = true;
      }
    }
    return rhreadRunning;
  }

Another example is to check whether there is deadlock thread in system.

public static void detectDeadlock() {
    ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
    long[] threadIds = null;
    threadIds = threadBean.findDeadlockedThreads();
    int deadlockedThreads = threadIds != null ? threadIds.length : 0;
    System.out.println("Number of deadlocked threads: " + deadlockedThreads);

    if (threadIds != null) {
      ThreadInfo[] infos = threadBean.getThreadInfo(threadIds);
      for (ThreadInfo info : infos) {
        System.out.println("hanged thread: " + info);
      }
    }
  }

Be sure to use findDeadlockedThreads method in JDK6.0 and newer, not findMonitorDeadlockedThreads.

Solr 3.x threaddump.jsp uses ThreadMXBean to dump all threads.
Using HotSpotDiagnosticMXBean
If we are using Sun JDK, we can use com.sun.management.HotSpotDiagnosticMXBean to generate heap dump for example. This can be useful when we are trying to analyze memory leak, we can generate heap dump every X hours automatically, then compare the memory consumption.

public void generateHeapDump(String fileName, boolean live)
      throws IOException {
    MBeanServer server = ManagementFactory.getPlatformMBeanServer();
    HotSpotDiagnosticMXBean bean = ManagementFactory.newPlatformMXBeanProxy(
        server, "com.sun.management:type=HotSpotDiagnostic",
        HotSpotDiagnosticMXBean.class);
    bean.dumpHeap(fileName, live);
  }

Using MemoryMXBean and MemoryPoolMXBean to Monitor Memory Usage
memoryMxBean.getHeapMemoryUsage().getUsed() <=> runtime.totalMemory() - runtime.freeMemory()
memoryMxBean.getHeapMemoryUsage().getCommitted() <=> runtime.totalMemory()
memoryMxBean.getHeapMemoryUsage().getMax() <=> runtime.maxMemory()
JDK7: Use com.sun.management.OperatingSystemMXBean to Monitor System and Process CPU Load
Java 7 provides a new MBean that finally allows Java developers to access overall System and current process CPU load monitor.
OperatingSystemMXBean osBean = ManagementFactory.getPlatformMXBean(OperatingSystemMXBean.class);
osBean.getProcessCpuLoad();
osBean.getSystemCpuLoad();
Other MXBeans
OperatingSystemMXBean to get operation system info, such as name, arch, number of processors, system average load.
ClassLoadingMXBean to check loaded classes.
PlatformLoggingMXBean.

Tomcat's Diagnostics.java is a good guide to use platform Mbeans.
jmxstat project in Github provides command to connect to remote JMX server to execute operations at a regular interval.
Resources
Monitoring and Management Using JMX Technology
Tomcat Diagnostics.java
http://svn.apache.org/repos/asf/tomcat/trunk/java/org/apache/tomcat/util/Diagnostics.java
Deadlock Detection in Java
Remote Monitoring Heap Memory Usage Using JMX RMI

Simple Cache Implementation in Java

LRU Cache: Using LinkedHashMap
Java LinkedHashMap and LinkedHashSet maintains the order of insertion or access. By default it's the insertion order, we can change it using the constructor: LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder).
Setting accessOrder to true would make LinkedHashMap sort elements according to the order of access.

LinkedHashMap maintains an extra doubled linked list to maintain the element order: Entry<K,V> header; The header is like a sentinel element, its after points to the oldest element.

LinkedHashMap.removeEldestEntry(Map.Entry<K,V> eldest) allow us to specify whether to remove the oldest one.

class LRUCacheMap<K, V> extends LinkedHashMap<K, V> {
  private static final long serialVersionUID = 1L;
  private int capacity;
  public LRUCacheMap(int capacity) {
    this.capacity = capacity;
  }
  @Override
  protected boolean removeEldestEntry(java.util.Map.Entry<K, V> eldest) {
    return size() > capacity;
  }
}

LRU Cache: Using LinkedHashSet
1. Overwrite add method in LinkedHashSet
When implement a hash based LRU cache, we can extend LinkedHashSet, then overwrite its add method: when current size is equal or larger than MAX_SIZE, use the iterator to remove the first element, than add the new item.

class LRUCacheSet<E> extends LinkedHashSet<E> {
  private static final long serialVersionUID = 1L;
  private int capacity;
  public LRUCacheSet(int capacity) {
    this.capacity = capacity;
  }
  @Override
  public boolean add(E e) {
    if (size() >= capacity) {
      // here, we can do anything.
      // 1. LRU cache, delete the eldest one(the first one) then add the new
      // item.
      Iterator<E> it = this.iterator();
      it.next();
      it.remove();

      // 2. We can do nothing, just return false: this will discard the new
      // item.
      // return false;
    }
    return super.add(e);
  }
}

2. Using LinkedHashMap to implement LRUCacheSet via Collections.newSetFromMap
HashSet actually uses a HashMap as the backend. Collections.newSetFromMap (introduced in JDK6) allow us to construct a set backed by the specified map. The resulting set displays the same ordering, concurrency, and performance characteristics as the backing map.

So we can use the previous LRUCacheMap as the backing map.

lruCache = Collections.newSetFromMap(new LRUCacheMap<Integer, Boolean>(
    MAX_SIZE));
for (int i = 0; i < 10; i++) {
  lruCache.add(i);
}
Assert.assertArrayEquals(new Integer[] { 5, 6, 7, 8, 9 },
      lruCache.toArray());

More about Collections.newSetFromMap
There are many HashMap implementations that doesn't have a corresponding Set implementation: such as ConcurrentHashMap, WeakHashMap, IdentityHashMap.
Now we can easily create hash instances that work same as ConcurrentHashMap,ConcurrentSkipListMap, WeakHashMap:
Set<Object> concurrenthashset = Collections.newSetFromMap(new ConcurrentHashMap<Object, Boolean>());
Set<Object> weakHashSet = Collections.newSetFromMap(new WeakHashMap<Object, Boolean>());

Using Guava CacheBuilder
Google Guava is a very useful library, and it's most likely that it's already included in our project. Guava provides a CacheBuilder which allows us to build a custom cache with different features combination.

LoadingCache<Key, Graph> graphs = CacheBuilder.newBuilder()
.maximumSize(1000).expireAfterWrite(10, TimeUnit.MINUTES)
.removalListener(MY_LISTENER)
.build(
 new CacheLoader<Key, Graph>() {
   public Graph load(Key key) throws AnyException {
     return createExpensiveGraph(key);
   }
});

Resources
LinkedHashMap's hidden (?) features
Handy But Hidden: Collections.newSetFromMap()
Guava CacheBuilder

Using HTML Parser Jsoup and Regular Expression to Get Text between Tow Tags

The Task
In this article, we are going to use jsoup to parse html pages to get all TOC(table of content) anchor links, and use regular expression to get all text content of each anchor link.

The Solution
Jsoup is a java HTML parser, its jquery-like and regex selector syntax makes it very easy to use to extract content form html page.

Normally a site has some convention about where it puts the TOC anchor link: from this we can compose a css selector to select all anchor link. We will take this Java_Development_Kit wikipedia page as an example.

Use Jsoup to Get All Anchor Links
To try CSS selector, we can open Chrome Developer tools, in the console tab: use document.querySelectorAll("CSS_SELECTOR_HERE"); to test our css selector.

Our final css selector would be:
div[id=toc]>ul>li a[href^='#']:not([href='#'])
in the id=toc div section, get it's direct child ui element, then get direct child li elements, fina all link with href attribute: value of href should be started with #(means this points to an anchor link), but no '#".

The Code
One caveat: Jsoup doesn't like the ' or " around attribute value, the old css selector will cause no match.
The final css selector for Jsoup is: div[id=toc] ul>li a[href^=#]:not([href=#])

Document doc = Jsoup.connect(url).get();
Element rootElement = doc.select(PATTERN_BODY_ROOT).first();
Set<String> anchors = new LinkedHashSet<String>();
Elements elements = rootElement.select(TOC_ANCHOR);
if (!elements.isEmpty()) {
  for (Element element : elements) {
    String href = element.attr("href");
    anchors.add(href.substring(1));
  }
}

Using Regular Expression and Jsoup to Get Text of each Anchor
First definition of the content of an anchor in our case: it's the all content between the current anchor and the next anchor.

The regular expression to get all html content between the the anchor JDK_contents and the anchor Ambiguity_between_a_JDK_and_an_SDK is like below:
<span[^>]*\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)</span>(.*)(<span[^>]*\s*(?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*</span>.*)

In another post, we will introduce how to use tool RegexBuudy to test and compose this regular expression and improve the regulare expression to boost the performance.

After get the HMTL content, we call Jsoup.parse(html).text(); to get all combined text.

The Code

public String getContentBetweenAnchor(StringBuilder remaining,
    String anchor1, String anchor2, String anchorElement,
    String anchorAttribute) throws IOException {
  StringBuilder sb = new StringBuilder();
  // the first group is the anchor text
  sb.append(matchAnchorRegexStr(anchor1, anchorElement, true))
      // the second group is the text between these 2 anchors
      .append("(.*)")
      // the third group is the remaining text
      .append("(").append(matchAnchorRegexStr(anchor2, anchorElement, false))
      .append(".*)");

  System.out.println(sb);
  Matcher matcher = Pattern.compile(sb.toString(),
      Pattern.DOTALL | Pattern.MULTILINE).matcher(remaining);
  String matchedText = "";
  if (matcher.find()) {
    String anchorText = Jsoup.parse(matcher.group(1)).text();
    matchedText = anchorText + " " + Jsoup.parse(matcher.group(2)).text();
    String newRemaining = matcher.group(3);
    remaining.setLength(0);
    remaining.append(newRemaining);
  }
  return matchedText;
}

The Complete Code

package org.codeexample.lifelongprogrammer.anchorlinks;

import org.apache.commons.lang.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.google.common.base.Stopwatch;

public class JsoupExample {
  private static final String TOC_ANCHOR = "div[id=toc] ul>li a[href^=#]:not([href=#])";
  private static final String PLAIN_ANCHOR_A_TAG = "a[href^=#]:not([href=#])";

  private static final int MAX_ANCHOR_LINKS = 5;
  // only <div id="bodyContent"> section
  private static final String PATTERN_BODY_ROOT = "div[id=bodyContent]";

  public Map<String, String> parseHTML(String url) throws IOException {
    Map<String, String> anchorContents = new LinkedHashMap<String, String>();

    Document doc = Jsoup.connect(url).get();
    Element rootElement = doc.select(PATTERN_BODY_ROOT).first();
    if (rootElement == null)
      return anchorContents;
    Set<String> anchors = getAnchors(rootElement);
    if (anchors.isEmpty())
      return anchorContents;
    StringBuilder remaining = new StringBuilder(rootElement.toString());

    Iterator<String> it = anchors.iterator();
    String current = it.next();
    while (it.hasNext() && remaining.length() > >0) {
      String next = it.next();
      anchorContents
          .put(
              current,
              getContentBetweenAnchorInWiki(remaining, current, next, "span",
                  "id"));
      current = next;
    }
    // last one
    String lastTxt = Jsoup.parse(remaining.toString()).text();
    if (StringUtils.isNotBlank(lastTxt)) {
      anchorContents.put(current, lastTxt);
    }
    return anchorContents;
  }

  public Set<String> getAnchors(Element rootElement) {
    Set<String> anchors = new LinkedHashSet<String>() {
      private static final long serialVersionUID = 1L;

      @Override
      public boolean add(String e) {
        if (size() >= MAX_ANCHOR_LINKS)
          return false;
        return super.add(e);
      }
    };
    getAnchorsImpl(rootElement, TOC_ANCHOR, anchors);
    if (anchors.isEmpty()) {
      // no toc anchor found, then use
      getAnchorsImpl(rootElement, PLAIN_ANCHOR_A_TAG, anchors);
    }
    return anchors;
  }

  public void getAnchorsImpl(Element rootElement, String anchorPattern,
      Set<String> anchors) {
    Elements elements = rootElement.select(anchorPattern);
    if (!elements.isEmpty()) {
      for (Element element : elements) {
        String href = element.attr("href");
        anchors.add(href.substring(1));
      }
    }
  }

  public String getContentBetweenAnchor(StringBuilder remaining,
      String anchor1, String anchor2, String anchorElement,
      String anchorAttribute) throws IOException {
    StringBuilder sb = new StringBuilder();
    // the first group is the anchor text
    sb.append(matchAnchorRegexStr(anchor1, anchorElement, true))
        // the second group is the text between these 2 anchors
        .append("(.*)")
        // the third group is the remaing text
        .append("(").append(matchAnchorRegexStr(anchor2, anchorElement, false))
        .append(".*)");

    System.out.println(sb);
    Matcher matcher = Pattern.compile(sb.toString(),
        Pattern.DOTALL | Pattern.MULTILINE).matcher(remaining);
    String matchedText = "";
    if (matcher.find()) {
      String anchorText = Jsoup.parse(matcher.group(1)).text();
      matchedText = anchorText + " " + Jsoup.parse(matcher.group(2)).text();
      String newRemaining = matcher.group(3);
      remaining.setLength(0);
      remaining.append(newRemaining);
    }
    return matchedText;
  }

  public String matchAnchorRegexStr(String anchor1, String anchorElement,
      boolean cpatureAnchorText) {
    StringBuilder sb = new StringBuilder().append("<").append(anchorElement)
        .append("[^>]*").append("\\s*").append("(?:\"|')?").append(anchor1)
        .append("(?:'|\")?[^>]*>");
    if (cpatureAnchorText) {
      sb.append("([^<]*)");
    } else {
      sb.append("[^<]*");
    }
    return sb.append("</").append(anchorElement).append(">").toString();
  }

  @Test
  public void testWiki() throws IOException {
    Stopwatch stopwatch = Stopwatch.createStarted();
    String url = "http://en.wikipedia.org/wiki/Java_Development_Kit";
    Map<String, String> anchorContents = parseHTML(url);
    System.out.println(anchorContents);
    System.out.println("Took " + stopwatch.elapsed(TimeUnit.MILLISECONDS));
    stopwatch.stop();
  }  
}

Resources
Comparison of HTML parsers
jsoup
CSS Selector Reference

Using Object Pool Design Pattern to Reduce Garbage Collection in Solr

Object Pool is a common used design pattern in Android. The most famous example is android.os.Message class.

The same pattern can be used in Solr. When Solr is running in machine with limited resource or running at client machine, and it's important to reduce resource usage as possible as we can.

In Solr application, we usually have to create thousands or millions of SolrQueryRequest, SolrQueryResponse, UpdateCommand.

As described at Solr: Export Large(Millions) Data to a CSV File, it may create a lot of SolrQueryRequest objects in short time, also as described at Solr RefCounted: Don't forget to close SolrQueryRequest or decref solrCore.getSearcher, It's important to close SolrQueryRequest otherwise it will cause resource(SolrIndexSearcher) leak.

So this is a good place to use the object pool pattern to reuse SolrQueryRequest and encapsulate its creation and close operations.

PooledLocalSolrQueryRequest
How to Use PooledLocalSolrQueryRequest
1. Call one of the obtain methods in PooledLocalSolrQueryRequest: obtain(solrCore), obtain(solrCore, solrParams), obtain(solrCore, solrParams, streams) to get a SolrQueryRequest object: it will reuse existing one if exists or create a new if no available free object.

2. Call pooledRequest.recycle() when no need to use the pooledRequest instance any more.

3. Call PooledLocalSolrQueryRequest.closeAll() to clean all cached free PooledLocalSolrQueryRequest instances.
The Code

package org.codeexample.lifelongprogrammer.solr.util;

public class PooledLocalSolrQueryRequest extends LocalSolrQueryRequest {
  protected static final Logger logger = LoggerFactory
      .getLogger(CVCSVExportToFileHandler.class);
  
  /**
   * Private constructor, only allow new objects from obtain()
   */
  private PooledLocalSolrQueryRequest(SolrCore core,
      ModifiableSolrParams modifiableSolrParams) {
    super(core, modifiableSolrParams);
  }
  
  // Reference to next object in the pool
  private PooledLocalSolrQueryRequest next;
  
  private static final Object sPoolSync = new Object();
  private static PooledLocalSolrQueryRequest firstInstance;
  
  // private static int sPoolSize = 0;
  // private static final int MAX_POOL_SIZE = 50;
  
  /**
   * Return a new LocalSolrQueryRequest, caller has to set SolrParmas, and
   * ContentStream if needed.
   */
  public static PooledLocalSolrQueryRequest obtain(SolrCore core) {
    synchronized (sPoolSync) {
      if (firstInstance != null) {
        PooledLocalSolrQueryRequest m = firstInstance;
        firstInstance = m.next;
        m.next = null;
        // sPoolSize--;
        return m;
      }
    }
    return new PooledLocalSolrQueryRequest(core, new ModifiableSolrParams());
  }
  
  public static PooledLocalSolrQueryRequest obtain(SolrCore core,
      ModifiableSolrParams newParams) {
    PooledLocalSolrQueryRequest request = obtain(core);
    request.setParams(newParams);
    return request;
  }
  
  public static PooledLocalSolrQueryRequest obtain(SolrCore core,
      ModifiableSolrParams newParams, Iterable<ContentStream> newStream) {
    PooledLocalSolrQueryRequest request = obtain(core);
    request.setParams(newParams);
    request.setContentStreams(newStream);
    return request;
  }
  
  /**
   * Recycle this object. You must release all references to this instance after
   * calling this method.
   */
  public void recycle() {
    
    this.close();
    synchronized (sPoolSync) {
      // if (sPoolSize < MAX_POOL_SIZE) {
      next = firstInstance;
      firstInstance = this;
      // sPoolSize++;
      // }
    }
  }
  
  /**
   * Close All SolrQueryRequest
   */
  public static void closeAll() {
    synchronized (sPoolSync) {
      
      while (firstInstance != null) {
        firstInstance.close();
        firstInstance= null;
        firstInstance = firstInstance.next;
      }
    }
  }
}

Resources
Recycling objects in Android with an Object Pool to avoid garbage collection.
Object Pool
android.os.Message class
Solr: Export Large(Millions) Data to a CSV File
Solr RefCounted: Don't forget to close SolrQueryRequest or decref solrCore.getSearcher

Solr: Export Large(Millions) Data to a CSV File

The Problem
The task is to dump all access time over 5 years data to csv file from several solr servers which are deployed in virtual machines with limited resource: 4g memory only.
There are 2.8 millions items that matches the query.

First Approach that doesn't work
The first approach is to get 1000 rows each time, repeat until get all data: the query looks like start=X&rows=1000. This seems easy and should just work.
But when I run it, all solr servers froze for several hours, and did't finish after 3 hours.

This is caused by a long-lasting problem in Solr: Deep pagination.
Simply put, when we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.
The problem get even worse in solr cloud or distributed search mode, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.

Luckily, this problem is going to be fixed in Solr-5463 in the coming Solr 4.7.
Please read more detail at Solr Deep Pagination Problem Fixed in Solr-5463

Final Solution
I check outed latest branch_4x from Solr SVN, run the following commands to build a new solr.war:
ant get-maven-poms
cd maven-build
mvn -DskipTests install
Then define a custom Solr request handler: CSVExportToFileHandler. It supports all solr query params: such as q, fq, rows, shards, etc. But one limit is the start must be 0. Also it supports all parameters in CSVResponseWriter, such as csv.header, csv.separator, csv.null.
Main ideas
1. Use the cursor feature introduced in Solr-5463.
Add cursorMark=* in the first query, parse nextCursorMark from response, and it as new cursorMark value in subsequent value.

2. Execute query and disk write in parallel.
To run the dump task faster, I execute the query and disk write operation in parallel. Disk operation is run in a separate thread using Executors.newSingleThreadExecutor().

The file will be dumped into a data folder in solr server.

The query looks like below:
http://solr1:8080/solr/exportcsvfile?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr,solr2:8080/solr,solr3:8080/solr&overwrite=true&parallel=true&fileName=over5yearsold.csv
The response of dump operation looks like below:

<response>
  <long name="readRows">2821608</long>
  <bool name="success">true</bool>
  <str name="msg">File saved to: C:\lifelongprogrammer\solr\data-folder\over5yearsold.csv</str>
  <long name="timeTaken">2092575</long>
</response>

The configuration in SolrConfig.xml:

<requestHandler name="/exportcsvfile" class="org.codeexample.lifelongprogrammer.solr.handler.CSVExportToFileHandler">
  <lst name="defaults">
    <str name="dataFolder">data-folder</str>
    <bool name="parallel">false</bool>
    <double name="freeMemoryRatioLimit">0.2</double>
  </lst>
</requestHandler>

The Code
Please read Using Object Pool Design Pattern to Reduce Garbage Collection in Solr for the source code of PooledLocalSolrQueryRequest.
package org.codeexample.lifelongprogrammer.solr.handler;

public class CSVExportToFileHandler extends RequestHandlerBase {
  protected static final Logger logger = LoggerFactory
      .getLogger(CSVExportToFileHandler.class);
  private static final Charset UTF8 = Charset.forName("UTF-8");
  
  private static final int ROWS_MAX_ONE_TIME = 1000;
  
  private static final String PARAM_DATA_FOLDER = "dataFolder",
      PARAM_FILE_NAME = "fileName", PARAM_OVERWRITE = "overwrite",
      PARAM_IO_PARALLEL = "parallel",
      PARAM_FREE_MEMORY_RATIO_LIMIT = "freeMemoryRatioLimit";
  private String dataFolder;
  private boolean defaultIOParallel;
  private Double freeMemoryRatioLimit = null;
  
  @SuppressWarnings("rawtypes")
  @Override
  public void init(NamedList args) {
    super.init(args);
    
    if (defaults != null) {
      dataFolder = defaults.get(PARAM_DATA_FOLDER);
      if (StringUtils.isBlank(dataFolder)) {
        throw new IllegalArgumentException("No dataFolder is set!");
      }
      String str = defaults.get(PARAM_IO_PARALLEL);
      if (StringUtils.isNotBlank(str)) {
        defaultIOParallel = Boolean.parseBoolean(str);
      }
      str = defaults.get(PARAM_FREE_MEMORY_RATIO_LIMIT);
      if (StringUtils.isNotBlank(str)) {
        freeMemoryRatioLimit = Double.parseDouble(str);
      }
    }
  }
  
  @Override
  public void handleRequestBody(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) throws Exception {
    Stopwatch stopwatch = new Stopwatch().start();
    
    SolrParams params = oldReq.getParams();
    boolean success = true;
    File destFile = getDestFile(oldReq, params);
    Writer writer = null;
    
    try {
      FileOutputStream fos = new FileOutputStream(destFile);
      writer = new OutputStreamWriter(fos, UTF8);
      Integer rows = params.getInt(CommonParams.ROWS);
      if (rows != null && rows < ROWS_MAX_ONE_TIME) {
        ModifiableSolrParams newParams = new ModifiableSolrParams(
            oldReq.getParams());
        newParams.set(CommonParams.WT, "csv");
        exeAndwriteRspToFie(
            PooledLocalSolrQueryRequest.obtain(oldReq.getCore(), newParams),
            writer, false, null);
      } else {
        long readRows = exeAndwriteRspToFileMultileSteps(oldReq, rows, writer);
        oldRsp.add("readRows", readRows);
      }
      oldRsp.add("success", success);
      oldRsp.add("msg", "File saved to: " + destFile);
    } catch (Exception e) {
      success = false;
      oldRsp.add("success", success);
      oldRsp.setException(e);
      logger.error("Exception happened when handles " + params, e);
    } finally {
      PooledLocalSolrQueryRequest.closeAll();
      long timeTaken = stopwatch.elapsed(TimeUnit.MILLISECONDS);
      logger.info("Import done, success: " + success + ", timeTaken: "
          + timeTaken + ", file saved to: " + destFile);
      oldRsp.add("timeTaken", timeTaken);
      IOUtils.closeQuietly(writer);
    }
  }
  
  @SuppressWarnings("unchecked")
  public long exeAndwriteRspToFileMultileSteps(SolrQueryRequest oldReq,
      Integer rows, Writer writer) throws FileNotFoundException, IOException,
      InterruptedException {
    SolrParams oldParams = oldReq.getParams();
    int start = 0;
    if (oldParams.getInt(CommonParams.START) != null) {
      start = oldParams.getInt(CommonParams.START);
    }
    if (start != 0) {
      throw new IllegalArgumentException(
          "Start must be 0, as Cursor functionality requires start=0");
    }
    // if rows is set, not null, track read rows
    int maxRows = -1;
    boolean trackReadRows = false;
    if (oldParams.getInt(CommonParams.ROWS) != null) {
      maxRows = oldParams.getInt(CommonParams.ROWS);
      trackReadRows = true;
    }
    long readRows = 0;
    boolean ioParallel = oldParams
        .getBool(PARAM_IO_PARALLEL, defaultIOParallel);
    ExecutorService ioExecutor = null;
    if (ioParallel) {
      ioExecutor = Executors.newSingleThreadExecutor();
    }
    // only check queryParallel when ioParallel is true
    Long numFound = null;
    boolean firstOp = true;
    int logTimes = 0;
    String cursorMark = "*";
    try {
      while (true) {
        ModifiableSolrParams newParams = newParams(oldParams, maxRows,
            trackReadRows, readRows, firstOp, cursorMark);
        firstOp = false;
        
        SolrQueryResponse newRsp = exeAndwriteRspToFie(
            PooledLocalSolrQueryRequest.obtain(oldReq.getCore(), newParams),
            writer, ioParallel, ioExecutor);
        
        NamedList<Object> valuesNL = newRsp.getValues();
        String nextCursorMark = (String) valuesNL.get("nextCursorMark");
        logger.info("New nextCursorMark: " + nextCursorMark);
        
        if (StringUtils.equals(nextCursorMark, cursorMark)) {
          // if same, means there is no data to read.
          break;
        }
        cursorMark = nextCursorMark;
        
        Object rspObj = (Object) valuesNL.get("response");
        if (rspObj == null) {
          throw new RuntimeException("response is null, " + valuesNL);
        }
        if (rspObj instanceof ResultContext) {
          ResultContext rc = (ResultContext) rspObj;
          if (numFound == null) {
            numFound = (long) rc.docs.matches();
            logger.info("numFound: " + numFound);
            if (maxRows == -1) {
              maxRows = numFound.intValue();
            }
          }
          readRows += rc.docs.size();
        } else if (rspObj instanceof SolrDocumentList) {
          SolrDocumentList docList = (SolrDocumentList) rspObj;
          if (numFound == null) {
            numFound = docList.getNumFound();
            logger.info("numFound: " + numFound);
            if (maxRows == -1) {
              maxRows = numFound.intValue();
            }
          }
          readRows += docList.size();
        } else {
          throw new RuntimeException("Unkown response type: "
              + rspObj.getClass());
        }
        
        if (readRows == maxRows) {
          break;
        } else if (readRows > maxRows) {
          throw new RuntimeException("Should not happen, want to get "
              + maxRows + ", but we have read " + readRows);
        }
        // log 10 times
        if (logger.isDebugEnabled()) {
          if (readRows > maxRows / 10 * logTimes) logger
              .info("Handled to end: " + maxRows);
        }
      }
      return readRows;
    } finally {
      if (ioParallel) {
        // wait IO thread complte
        ioExecutor.shutdown();
        ioExecutor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS);
      }
    }
  }
  
  public ModifiableSolrParams newParams(SolrParams oldParams, int maxRows,
      boolean trackReadRows, long readRows, boolean firstOp, String cursorMark) {
    ModifiableSolrParams newParams = new ModifiableSolrParams(oldParams);
    newParams.set(CommonParams.START, 0);
    if (!trackReadRows) {
      newParams.set(CommonParams.ROWS, ROWS_MAX_ONE_TIME);
    } else {
      if (maxRows == -1) {
        // for the first time
        newParams.set(CommonParams.ROWS, ROWS_MAX_ONE_TIME);
      } else {
        if (maxRows - readRows > ROWS_MAX_ONE_TIME) {
          newParams.set(CommonParams.ROWS, ROWS_MAX_ONE_TIME);
        } else {
          newParams.set(CommonParams.ROWS, "" + (maxRows - readRows));
        }
      }
    }
    newParams.set(CommonParams.WT, "csv");
    newParams.set("cursorMark", cursorMark);
    
    if (!firstOp) {
      // don't outut header
      newParams.set("csv.header", false);
    }
    return newParams;
  }
  
  /**
   * IO operation is expensive, so we want to do query and IO write in parallel.
   * 
   * @param parallel
   *          whether execute query and IO in parallel, if ture, it may take
   *          more memory.
   */
  public SolrQueryResponse exeAndwriteRspToFie(
      final PooledLocalSolrQueryRequest req, final Writer writer,
      boolean parallel, ExecutorService executor) throws FileNotFoundException,
      IOException {
    final SolrCore core = req.getCore();
    SolrRequestHandler handler = core.getRequestHandler("/select");
    final SolrQueryResponse newRsp = new SolrQueryResponse();
    handler.handleRequest(req, newRsp);
    Exception ex = newRsp.getException();
    if (ex != null) {
      throw new RuntimeException("Exception happend when handle: " + req, ex);
    }
    // Also consider memory usage
    if (parallel) {
      if (freeMemoryRatioLimit != null) {
        double curFreeMemoryRatio = getFreeMemoryRatio();
        parallel = curFreeMemoryRatio > freeMemoryRatioLimit;
        if (logger.isDebugEnabled() && !parallel) {
          logger.info("curFreeMemoryRatio: " + curFreeMemoryRatio
              + " is lesser than " + freeMemoryRatioLimit);
        }
      }
    }
    if (parallel) {
      executor.submit(new Runnable() {
        @Override
        public void run() {
          try {
            writeRspToFile(req, writer, core, newRsp);
          } catch (IOException e) {
            throw new RuntimeException(e);
          }
        }
      });
    } else {
      writeRspToFile(req, writer, core, newRsp);
    }
    return newRsp;
  }
  
  public static double getFreeMemoryRatio() {
    Runtime runtime = Runtime.getRuntime();
    long max = runtime.maxMemory();
    long current = runtime.totalMemory();
    long free = runtime.freeMemory();
    long available = max - current + free;
    return (double) (available) / (double) max;
  }
  
  public void writeRspToFile(PooledLocalSolrQueryRequest req, Writer writer,
      SolrCore core, SolrQueryResponse newRsp) throws IOException {
    QueryResponseWriter responseWriter = core.getQueryResponseWriter("csv");
    responseWriter.write(writer, req, newRsp);
    req.recycle();
  }
  
  public File getDestFile(SolrQueryRequest req, SolrParams params)
      throws SolrServerException {
    String fileName = params.get(PARAM_FILE_NAME);
    if (StringUtils.isBlank(fileName)) {
      throw new IllegalArgumentException("No fileName is set!");
    }
    
    if (!new File(dataFolder).isAbsolute()) {
      dataFolder = SolrResourceLoader.normalizeDir(req.getCore()
          .getCoreDescriptor().getCoreContainer().getSolrHome()
          + dataFolder);
    }
    if (!new File(dataFolder).exists()) {
      boolean created = new File(dataFolder).mkdir();
      if (!created) throw new SolrServerException("Unable to create folder: "
          + dataFolder);
    }
    
    File destFile = new File(dataFolder, fileName);
    if (destFile.exists()) {
      boolean overwrite = params.getBool(PARAM_OVERWRITE, false);
      if (overwrite) {
        boolean deleted = destFile.delete();
        if (!deleted) throw new RuntimeException("Failed to deleted old file: "
            + destFile);
      } else {
        throw new IllegalArgumentException("File: " + destFile
            + " already exists.");
      }
    }
    return destFile;
  } 
}

Resources
Pagination of Results

Solr: Deep Pagination Problem Fixed in Solr-5463

Solr-5463: Provide cursor/token based "searchAfter" support that works with arbitrary sorting

LUCENE-3514: deep paging with Sort

Coming Soon to Solr: Efficient Cursor Based Iteration of Large Result Sets

Deep Paging Problem

Solr CSVResponseWriter

Solr Deep Pagination Problem Fixed in Solr-5463

The Problem
We need iterate and dump more than millions data into a file. The data is deployed in multiple solr server in mltiple virtual machine with just 4gb memory.
When I tried to run the dump task, these vms totally froze for more than hours. The memory usage is more than 98%.

The problem is caused by a long-lasting problem in Solr:
When we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.

In the case of SolrCloud, the problem gets even worse, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.

In older release, developers found some workaround, such as described at Solr Deep Pagination.

Solution: Solr-5463(LUCENE-3514)
In coming Solr 4.7, it solves this problem in SOLR-5463.

The basic idea is that:
Get the first 1000 rows:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*

Parse the response to get the value of nextCursorMark:

<response>
<str name="nextCursorMark">AoJ42tmu/Z4CKTQxMDMyMzEwMw==</str>
</response>

Then to get the next 1000 rows: [10001-2000]

http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=AoJ42tmu%2FZ4CKTQxMDMyMzEwMw%3D%3D

Repeat until the nextCursorMark value stops changing, or you have collected as many docs as you need.

Basic Usage from Solr-5463
start must be "0" in all request when use cursorMark
sort can be anything, but must include the uniqueKey field (as a tie breaker)
"N" can be any number you want per page
"*" denotes you want to use a cursor starting at the beginning mark
Replace the "*" value in your initial request params with the nextCursorMark value from the response in the subsequent request

Resources
Solr-5463: Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
LUCENE-3514: deep paging with Sort
Coming Soon to Solr: Efficient Cursor Based Iteration of Large Result Sets
Deep Paging Problem

Viewing Android Source Code from Android Studio

The Problem
When learning Android development, we may want to check Android source code once in a while.

When we click on a method of Android class, we may not see its source code like below:
android.os.Message.recycle
public void recycle() { /* compiled code */ }
The Solution
Install Android SDK Source Code in SDK Manager
Go to Tools -> Android -> SDK Manager.

In the Android SDK Manager panel, go to the Android SDK Version: 4.4.2(API 19) in this case, select "Source for Android SDK", this will download Android source code to android-studio\sdk\sources\android-19.

We can also select "Samples for SDK", "Documents dor Android SDK". Then click "Install packages...", select "Accept Licenese" in next dialogue, and click install.

Wait until the install operation complete.

Attach Android SDK Source Code to Sourcepath
Go to File -> Other Settings -> Default Project Structure...

Select SDKs, in the middle panel, select the Android version(API 19 in this case), in the right panel, click the "Sourcepath" tab, click + button, then browse to your SKD path: C:\Programs Files\Android\android-studio\sdk\sources\android-19, then click OK.

Verify
Now use Ctrl+N to open class android.os.Message, click Ctrl+F12, then type recycle, press enter, this will navigate to recycle method.

Now you can see the source code and javadoc of Android source code.

/**
* Return a Message instance to the global pool.  You MUST NOT touch
* the Message after calling this function -- it has effectively been
* freed.
*/
public void recycle() {
 clearForRecycle();
  synchronized (sPoolSync) {
 if (sPoolSize < MAX_POOL_SIZE) {
  next = sPool;
  sPool = this;
  sPoolSize++;
 }
}
}

Windows Batch Tricks: Use Delayed Expansion

One trick thing in Windows Batch is how it substitutes variable value: Variable expansion occurs when the command is read. This is especially confusing in if, for and multiple statements block.

If you run the following script, the output would be: VAR is still: first here: first
set VAR=first
if "%VAR%" == "first" (
set VAR=second
echo VAR is still: first here: %VAR%
if "%VAR%" == "second" @echo You will never see this
)
The batch processor treats the whole if block as one command, it expands the variables once and only once, before it executes the if block, so it is actually run:
if "first" == "first" (
set VAR=second
echo VAR is still: first here: first
if "first" == "second" @echo You will never see this
)
EnableDelayedExpansion
The rescue is Delayed Expansion, it will cause variables to be expanded at execution time rather than at parse time. It is enabled by Setlocal EnableDelayedExpansion, and use !variable_name! to tell batch processor to expand variable value at runtime.
Practical Example: Get jvm.dll path
When use Apache Procrun to wrap one Java Application as windows Services as described in this article: Windows BAT: Using Apache Procrun to Install Java Application As Windows Service

We try to get jvm.dll path from environment variable: APP_JAVA_HOME, JAVA_HOME, and set its value to PR_JVM.
The final version is like this:

@echo off
setlocal enabledelayedexpansion
set PR_JVM=auto
IF "%PR_JVM%" == "auto" (
set NEW_JAVA_HOME=%CV_JAVA_HOME%
if "!NEW_JAVA_HOME!" == "" (
set NEW_JAVA_HOME=%JAVA_HOME%
)
echo Using NEW_JAVA_HOME: !NEW_JAVA_HOME!
IF not "!NEW_JAVA_HOME!" == "" (
echo !NEW_JAVA_HOME! not empty, try to use it.
for /F "delims=" %%i in ('dir "!NEW_JAVA_HOME!" /B /S /a-d ^| findstr jvm.dll') do (
echo found "%%i"
set "NEW_PR_JVM=%%i"
)
IF exist "!NEW_PR_JVM!" (
SET "PR_JVM=!NEW_PR_JVM!"
echo NEW_PR_JVM: !NEW_PR_JVM!
)
)
)
echo final PR_JVM: %PR_JVM%
"%PRUNSRV%" "//US//%SERVICE_NAME%" --DisplayName "%PR_DESPLAYNAME%" --Description "%PR_DESCRIPTION%" --StdOutput auto --StdError auto ^
--Classpath="%MYCLASSPATH%" --Jvm="%PR_JVM%" --JvmOptions="%PR_JAVA_OPTIONS%" --StartPath "%APP_HOME%" --Startup=auto ^
--StartMode=jvm --StartClass=StartClass --StartParams="%START_PARAMS%" ^
--StopMode=jvm --StopClass=StopClass --StopParams="%STOP_PARAMS%"

ENDLOCAL

The output looks like:
Using NEW_JAVA_HOME: C:\Program Files\Java\jdk1.6.0_38
C:\Program Files\Java\jdk1.6.0_38 not empty, try to use it.
found "C:\Program Files\Java\jdk1.6.0_38\jre\bin\server\jvm.dll"
NEW_PR_JVM: C:\Program Files\Java\jdk1.6.0_38\jre\bin\server\jvm.dll
final PR_JVM: C:\Program Files\Java\jdk1.6.0_38\jre\bin\server\jvm.dll

The origin version is like, which doesn't work because how batch processor expand variable value:
@echo off
setlocal
set PR_JVM=auto
rem set CV_JAVA_HOME=
IF "%PR_JVM%" == "auto" (
set NEW_JAVA_HOME=%CV_JAVA_HOME%
if "%CV_JAVA_HOME%" == "" (
set NEW_JAVA_HOME=%JAVA_HOME%
)
echo Using NEW_JAVA_HOME: %NEW_JAVA_HOME%
IF not "%CV_JAVA_HOME%" == "" (
echo %CV_JAVA_HOME% not empty, try to use it.
for /F "delims=" %%i in ('dir "%CV_JAVA_HOME%" /B /S /a-d ^| findstr jvm.dll') do (
echo found "%%i"
set "NEW_PR_JVM=%%i"
)
IF exist "%NEW_PR_JVM%" (
SET "PR_JVM=%NEW_PR_JVM%"
echo NEW_PR_JVM: %NEW_PR_JVM%
)
)
)
echo final PR_JVM: %PR_JVM%
ENDLOCAL

The output looks like:
Using NEW_JAVA_HOME:
C:\Program Files\Java\jre7 not empty, try to use it.
found "C:\Program Files\Java\jre7\bin\server\jvm.dll"
final PR_JVM: auto

Resources
EnableDelayedExpansion
Environment variable expansion occurs when the command is read
Batch file :How to set a variable inside a loop for /F
http://stackoverflow.com/questions/691047/batch-file-variables-initialized-in-a-for-loop

Using Apache Commons Lang ExceptionUtils

Apache Commons Lang ExceptionUtils provides some utility methods to manipulate and examine Throwable, such as getRootCause, getStackTrace, getFullStackTrace etc.

The following would introduce some common usage of ExceptionUtils.

Check the Root Cause
Sometimes, we need check the root cause of one exception or whether the exception stack trace contains one specific exception.
For example, when Lucene commit, it may fail due to OutOfMemoryError. But the returned exception maybe IllegalStateException like blow: We need check whether OutOfMemoryError happened.
auto commit error...:java.lang.IllegalStateException: this writer hit
an OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2650)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2804)

We can use code like below:
Throwable th= ExceptionUtils.getRootCause();
boolean causedByOOM = th instanceof OutOfMemoryError;

Or if we want to check whether it contains the exception in any position in the exception chaining. Use code like below:
String fullTrace = ExceptionUtils.getFullStackTrace(th);
boolean isOOM = fullTrace.contains("OutOfMemoryError");
Or:

java.io.Writer result = new java.io.StringWriter();

java.io.PrintWriter printWriter = new java.io.PrintWriter(result);

e.printStackTrace(printWriter);

result.toString();

Check Caller or Code Branch
We may want to log some debug message when methodA is called by ClassX.MethodY -either directly or indirectly. We can use code like below:
if (logger.isDebugEnabled()) {
String fullStack = ExceptionUtils.getFullStackTrace(new Throwable());
boolean isXY = fullStack.contains("ClassX.MethodY");
if(isXY) {
logger.debug("some debug message");
}
Resource
Exception Utils Javadoc

C# Parsing is Regional(Culture) Sensitive

The Problem
Our C# application sent a query to Solr server, and parsing the response. It works fine, until recently one of our customers hits one exception:
Exception Message [Input string was not in a correct format.]
System.Number.ParseDouble(String value, NumberStyles options, NumberFormatInfo numfmt)
at System.Double.Parse(String s, NumberStyles style)

The Root Cause
Check the language settings, the server machine is using Finnish:fi-FI.
The Solr response always uses en-US locale format for numbers, it uses "." period as decimal separator:
<double name="min">10.01</double>

But in fi-FI[Finnish(Finland)], it uses "," comma as decimal separator.

This would cause double.parse fail to parse the string like: 10.01, and throws exception:
Input string was not in a correct format.

The fix is easy: using en-US culture when do double.parse: double.Parse("10.01", new CultureInfo("en-US")); ~~or set en-US culture as the current culture of current thread:~~
~~Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");~~

The workaround before we release the fix is to change the region and language settings to en_US for the user who starts the application.

Test Code

using System.Globalization;
using System.Threading;
// Add it via "add reference"
using Microsoft.VisualStudio.TestTools.UnitTesting;

static void Main(string[] args)
{
    CultureInfo originalCulture = Thread.CurrentThread.CurrentCulture;
    Console.WriteLine("Orignal Culture: " + originalCulture);
    CultureInfo enUS = CultureInfo.CreateSpecificCulture("en-US");

    CultureInfo fiFi = CultureInfo.CreateSpecificCulture("fi-FI");
    testParse(enUS, fiFi);
    testToString(enUS, fiFi);
    
    Console.ReadLine();
}
private static void testToString(CultureInfo enUS, CultureInfo fiFi)
{
    Thread.CurrentThread.CurrentCulture = fiFi;
    double d = 0.2;
    // output fi-FI format: 0,2
    String str = d.ToString();
    Console.WriteLine("fi-FI: " + d.ToString());
    Assert.AreEqual("0,2", str);

    str = d.ToString(enUS);
    // output en-US format: 0.2
    Console.WriteLine("fi-FI: " + str);
    Assert.AreEqual("0.2", str);
}
private static void testParse(CultureInfo enUS, CultureInfo fiFi)
{
    Thread.CurrentThread.CurrentCulture = fiFi;
    // print 10,01
    Console.WriteLine(10.01);
    double d1 = double.Parse("10,01");

    // use en-US to parse, this works fine.
    double d2 = double.Parse("10.01", enUS);
    try
    {
        // the follwing code would cause exception:  Input string was not in a correct format.
        double d3 = double.Parse("10.01");
        Assert.Fail("should fail");
    }
    catch (Exception e)
    {
        Console.WriteLine("expected ex: " + e);
    }
    // change culture to en-US
    Thread.CurrentThread.CurrentCulture = enUS;
    double d4 = double.Parse("10.01");
}

Update:
Today, we meet similar error, but this time is when C# sends a double string to java application: the region is actually de-AT. C# sent 0,2 to java servlet which is unable to parse it.

The solution is to convert to en-US: d.ToString(enUS)

Lesson Learned
When trouble shoot this problem in customser case, I noticed when I type some character, it is converted to other word, at that time, I should notice the customer is using a different language setting.

How to check Windows Region and Language Setting
Checking the Regional and Language Settings

Resources:
Java Locale “Finnish (Finland)” (fi-FI)
Using Language Identifiers (RFC 3066) - Region Code List

Labels