UIMA: Run Custom Regex Dynamically


The Problem
Extend UIMA Regex Annotator to allow user run custom regex dynamically.

Regular Expression Annotator allows us to easily define entity name(such as credit card, email) and regex to extract these entities.

But we can never define all useful entities, so it's good to allow customers to add their own entities and regex, and the UIMA Regular Expression Annotator would run them dynamically.

We can create and deploy a new annotator, but we decide to just extend UIMA RegExAnnotator.

How it Works
Client Side
We create one type org.apache.uima.input.dynamicregex with feature types and regexes. 
In our http interface, client specifies the entity name and its regex: 
host:port/nlp?text=abcxxdef&customTypes=mytype1,mytype2&customRegexes=abc.*,def.*

Client will add Feature Structure: org.apache.uima.input.dynamicregex.types=mytype1,mytype2 and org.apache.uima.input.dynamicregex.regexes=abc.*,def.*
public void addCustomRegex(List<String> customTypes,
    List<String> customRegexes, CAS cas) {
  if (customTypes != null && customRegexes != null) {
    if (customTypes.size() != customRegexes.size()) {
      throw new IllegalArgumentException(
          "Size doesn't match: customTypes size: "
              + customTypes.size() + ", customRegexes size: "
              + customRegexes.size());
    }
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts
        .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
    Type type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customTypes));
      cas.addFsToIndexes(fs);
    }

    ft = ts.getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
    type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customRegexes));
      cas.addFsToIndexes(fs);
    }
  }
}
public Result process(String text, String lang, List<String> uimaTypes,
    List<String> customTypes, List<String> customRegexes,
    Long waitMillseconds) throws Exception {
  CAS cas = this.ae.getCAS();
  String casId;
  try {
    cas.setDocumentText(text);
    cas.setDocumentLanguage(lang);
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts.getFeatureByFullName(UIMA_ENTITIES_FS);
    Type type = ts.getType(UIMA_ENTITIES);
    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(uimaTypes));
      cas.addFsToIndexes(fs);
    }
    addCustomRegex(customTypes, customRegexes, cas);
    casId = this.ae.sendCAS(cas);
  } catch (ResourceProcessException e) {
    // http://t17251.apache-uima-general.apachetalk.us/uima-as-client-is-blocking-t17251.html
    cas.release();
    logger.error("Exception thrown when process cas " + cas, e);
    throw e;
  }
  Result rst = this.listener.waitFinished(casId, waitMillseconds);
  return rst;
}
Define Feature Structures in RegExAnnotator.xml
org.apache.uima.input.dynamicregex is used as input paramter, client can specify value for its features: types and regexes. org.apache.uima.output.dynamicrege is the output type.
<typeDescription>
  <name>org.apache.uima.input.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>types</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
    <featureDescription>
      <name>regexes</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
  </features>          
</typeDescription>
<!-- output params -->
<typeDescription>
  <name>org.apache.uima.output.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>type</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>
  </features>
</typeDescription>

Run Custom Regex and Return Extracted Entities in  RegExAnnotator
Next, in RegExAnnotator.process method, we get value of the input types and regex, run custom regex and add found entities to CAS indexes.
public void process(CAS cas) throws AnalysisEngineProcessException {
  procressCutsomRegex(cas);
  //... omitted
}
private void procressCutsomRegex(CAS cas) {
  TypeSystem ts = cas.getTypeSystem();
  Type dyInputType = ts.getType("org.apache.uima.input.dynamicregex");
  org.apache.uima.cas.Feature dyInputTypesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
  org.apache.uima.cas.Feature dyInputRegexesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
  String dyTypes = null, dyRegexes = null;
  FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();

  AnnotationFS dyInputTypesFs = null, dyInputRegexesFs = null;
  while (dyIt.hasNext()) {
    // TODO this is kind of weird
    AnnotationFS afs = (AnnotationFS) dyIt.next();
    if (afs.getStringValue(dyInputTypesFt) != null) {
      dyTypes = afs.getStringValue(dyInputTypesFt);
      dyInputTypesFs = afs;
    }
    if (afs.getStringValue(dyInputRegexesFt) != null) {
      dyRegexes = afs.getStringValue(dyInputRegexesFt);
      dyInputRegexesFs = afs;
    }
  }
  if (dyInputTypesFs != null) {
    cas.removeFsFromIndexes(dyInputTypesFs);
  }
  if (dyInputRegexesFs != null) {
    cas.removeFsFromIndexes(dyInputRegexesFs);
  }
  String[] dyTypesArr = dyTypes.split(","), dyRegexesArr = dyRegexes
      .split(",");
  if (dyTypesArr.length != dyRegexesArr.length) {
    throw new IllegalArgumentException(
        "Size of custom regex doesn't match. types: "
            + dyTypesArr.length + ",  regexes: "
            + dyRegexesArr.length);
  }
  if (dyTypesArr.length == 0)
    return;
  logger.log(Level.FINE, "User specifies custom regex: type: " + dyTypes
      + ", regexes: " + dyRegexes);
  String docText = cas.getDocumentText();
  Type dyOutputType = ts.getType("org.apache.uima.output.dynamicregex");
  org.apache.uima.cas.Feature dyOutputTypeFt = ts
      .getFeatureByFullName("org.apache.uima.output.dynamicregex:type");
  FSIndexRepository indexRepository = cas.getIndexRepository();
  for (int i = 0; i < dyTypesArr.length; i++) {
    Pattern pattern = Pattern.compile(dyRegexesArr[i]);
    Integer captureGroupPos = getNamedGrpupPosition(pattern, "capture");
    Matcher matcher = pattern.matcher(docText);

    while (matcher.find()) {
      AnnotationFS dyAnnFS;
      // if named group capture exists
      if (captureGroupPos != null) {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(captureGroupPos),
            matcher.end(captureGroupPos));
      } else {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(), matcher.end());
      }
      dyAnnFS.setStringValue(dyOutputTypeFt, dyTypesArr[i]);
      indexRepository.addFS(dyAnnFS);
    }
  }
}
/**
 * Use reflection to call namedGroups in JDK7
 */
@SuppressWarnings("unchecked")
private Integer getNamedGrpupPosition(Pattern pattern, String namedGroup) {
  try {
    Method namedGroupsMethod = Pattern.class.getDeclaredMethod(
        "namedGroups", null);
    namedGroupsMethod.setAccessible(true);

    Map<String, Integer> namedGroups = (Map<String, Integer>) namedGroupsMethod
        .invoke(pattern, null);
    return namedGroups.get(namedGroup);
  } catch (Exception e) {
    throw new RuntimeException(e);
  }
}
References
UIMA References
Apache UIMA Regular Expression Annotator Documentation

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)