Using Apache OpenNLP for Natural Language Processing

Natural Language Processing (NLP) is one of the most important frontiers in the field of software. Since the advent of digital computing, the fundamental idea—how to effectively use and generate human language—has been a continuous effort. Today, this work continues, with machine learning and databases at the forefront of mastering natural language. This article is a hands-on introduction to Apache OpenNLP, a Java-based machine learning project that provides primitives such as chunking and lemmatization, both of which are essential for building NLP-enabled systems.

What is Apache OpenNLP?

Machine learning natural language processing systems like Apache OpenNLP typically consist of three parts:

Learning from a corpus, which is a set of text data (plural: corpora)
A model generated from the corpus
Using the model to perform tasks on the target text

To simplify things, OpenNLP provides pre-trained models for many common use cases. For more complex requirements, you may need to train your own models. For simpler scenarios, you can just download existing models and apply them to the task at hand.

Detecting Language with OpenNLP

Let’s build a basic application that we can use to understand how OpenNLP works. We can start the layout using a Maven archetype, as shown in Listing 1.

Listing 1. Creating a New Project

~/apache-maven-3.8.6/bin/mvn archetype:generate -DgroupId=com.infoworld.com -DartifactId=opennlp -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false

This archetype will build a new Java project. Next, add the Apache OpenNLP dependencies to the pom.xml in the root directory of the project, as shown in Listing 2. (You can use any version of the OpenNLP dependencies available.)

Listing 2. OpenNLP Maven Dependency

<dependency>  <groupId>org.apache.opennlp</groupId>  <artifactId>opennlp-tools</artifactId>  <version>2.0.0</version></dependency>

To make it easier to execute the program, also add the following entry to the pom.xml file:

Listing 3. Main Class Execution Goal in Maven POM

<plugin>            <groupId>org.codehaus.mojo</groupId>            <artifactId>exec-maven-plugin</artifactId>            <version>3.0.0</version>            <configuration>                <mainClass>com.infoworld.App</mainClass>            </configuration>        </plugin>

Now, run the program using maven compile exec:java. (You need expertise and have the JDK installed to run this command.) Running it now will give you the familiar “Hello World!” output.

Downloading and Setting Up the Language Detection Model

Now that we are ready to use OpenNLP to detect the language in the sample program, the first step is to download the language detection model. Download the latest language detector component from the OpenNLP model download page. At the time of writing, the current version is langdetect-183.bin. To make the model easier to obtain, let’s navigate to the Maven project and create a new directory at /opennlp/src/main/resources, and then copy the langdetect-*.bin file into it. Now, let’s modify the existing file to what you see in Listing 4. We will use /opennlp/src/main/java/com/infoworld/App.java in this example.

Listing 4. App.java

package com.infoworld;
import java.util.Arrays;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import opennlp.tools.langdetect.LanguageDetectorModel;
import opennlp.tools.langdetect.LanguageDetector;
import opennlp.tools.langdetect.LanguageDetectorME;
import opennlp.tools.langdetect.Language;
public class App {  public static void main( String[] args ) {    System.out.println( "Hello World!" );    App app = new App();    try {      app.nlp();    } catch (IOException ioe){      System.err.println("Problem: " + ioe);    }  }  public void nlp() throws IOException {    InputStream is = this.getClass().getClassLoader().getResourceAsStream("langdetect-183.bin"); // 1    LanguageDetectorModel langModel = new LanguageDetectorModel(is); // 2    String input = "This is a test.  This is only a test.  Do not pass go.  Do not collect $200.  When in the course of human history."; // 3    LanguageDetector langDetect = new LanguageDetectorME(langModel); // 4    Language langGuess = langDetect.predictLanguage(input); // 5
    System.out.println("Language best guess: " + langGuess.getLang());
    Language[] languages = langDetect.predictLanguages(input);
    System.out.println("Languages: " + Arrays.toString(languages));  }}

Now, you can run the program using the command maven compile exec:java. When you do this, you will get output similar to that shown in Listing 5.

Listing 5. Language Detection Run 1

Language best guess: engLanguages: [eng (0.09568318011427969), tgl (0.027236092538322446), cym (0.02607472496029117), war (0.023722424236917564)...

In this example, “ME” stands for Maximum Entropy. Maximum entropy is a statistical concept used in natural language processing to optimize the best results.

Evaluating Results

After running the program, you will see that the OpenNLP language detector accurately guessed that the language of the text in the sample program is English. We also output some probabilities derived from the language detection algorithm. After English, it guessed that the language might be Tagalog, Welsh, or War-Jaintia. In defense of the detector, the language sample is small. Correctly identifying the language from just a few sentences without additional context is quite impressive. Before we proceed, let’s review Listing 4. The process is very straightforward. Here’s how each commented line works:

Open the langdetect-183.bin file as an input stream.
Use the input stream to parameterize the instantiation of LanguageDetectorModel.
Create a string to serve as input.
Create a language detector object using the LanguageDetectorModel from line 2.
Run the langDetect.predictLanguage() method on the input from line 3.

Testing Probabilities

If we add more English text to the string and run it again, the probability assigned to eng should increase. Let’s try this by pasting the content of the Declaration of Independence into a new file in our project directory: /src/main/resources/declaration.txt. We will load and process it as shown in Listing 6, replacing the inline string:

Listing 6. Loading Declaration Text

String input = new String(this.getClass().getClassLoader().getResourceAsStream("declaration.txt").readAllBytes());

If you run it, you will see that English is still the detected language.

Using OpenNLP to Detect Sentences

You have already seen the language detection model in action. Now, let’s try a model for detecting sentences. First, return to the OpenNLP model download page and add the latest Sentence English model component to the project’s /resources directory. Note that knowing the language of the text is a prerequisite for detecting sentences. We will follow a similar pattern to what we did with the language detection model: load the file (in my case opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin) and use it to instantiate a sentence detector. Then, we will use the detector on the input file. You can see the new code (and its imports) in Listing 7; the rest of the code remains unchanged.

Listing 7. Detecting Sentences

import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.sentdetect.SentenceDetectorME;
//...InputStream modelFile = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin");    SentenceModel sentModel = new SentenceModel(modelFile);        SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentModel);    String sentences[] = sentenceDetector.sentDetect(input);    System.out.println("Sentences: " + sentences.length + " first line: "+ sentences[2])

Running this file will produce output as shown in Listing 8.

Listing 8. Output of Sentence Detector

Sentences: 41 first line: In Congress, July 4, 1776
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, ...

Note that the sentence detector found 41 sentences, which sounds right. Also note that this detector model is very simple: it only looks for periods and spaces to find break points. It has no grammatical logic. This is why we use index 2 on the sentences array to get the actual preamble—the title line is merged into two sentences. (The founding document is notoriously inconsistent with punctuation, and the sentence detector does not attempt to treat “When in the Course ……” as a new sentence.)

Using OpenNLP for Tokenization

After breaking the document into sentences, tokenization is the next level of granularity. Tokenization is the process of breaking a document down into words and punctuation. We can use the code shown in Listing 9:

Listing 9. Tokenization

import opennlp.tools.tokenize.SimpleTokenizer;
//...SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;    String[] tokens = tokenizer.tokenize(input);    System.out.println("tokens: " + tokens.length + " : " + tokens[73] + " " + tokens[74] + " " + tokens[75]);

This will produce output as shown in Listing 10.

Listing 10. Tokenizer Output

tokens: 1704 : human events ,

Thus, the model breaks the document into 1704 tokens. We can access the token array, where the word “human events” and the following comma each occupy one element.

Using OpenNLP for Name Finding

Now, we will get the English “Name Finder” model, called en-ner-person.bin. This model is located on the Sourceforge model download page. After obtaining the model, place it in the project’s resources directory and use it to find names in the document, as shown in Listing 11.

Listing 11. Using OpenNLP for Name Finding

import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.util.Span;
//...InputStream nameFinderFile = this.getClass().getClassLoader().getResourceAsStream("en-ner-person.bin");    TokenNameFinderModel nameFinderModel = new TokenNameFinderModel(nameFinderFile);    NameFinderME nameFinder = new NameFinderME(nameFinderModel);    Span[] names = nameFinder.find(tokens);    System.out.println("names: " + names.length);    for (Span nameSpan : names){      System.out.println("name: " + nameSpan + " : " + tokens[nameSpan.getStart()-1] + " " + tokens[nameSpan.getEnd()-1]);}

In Listing 11, we load the model and use it to instantiate a NameFinderME object. We then use it to get an array of names modeled as span objects. The span has a start and end, telling us where the detector thinks the name begins and ends in the token set. Note that the name finder requires a tokenized string array.

Using OpenNLP for Part-of-Speech Tagging

OpenNLP allows us to tag parts of speech (POS) based on the tokenized strings. Listing 12 is an example of POS tagging.

Listing 12. Part-of-Speech Tagging

import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
//…InputStream posIS = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-pos-1.0-1.9.3.bin");POSModel posModel = new POSModel(posIS);POSTaggerME posTagger = new POSTaggerME(posModel);String tags[] = posTagger.tag(tokens);System.out.println("tags: " + tags.length);
for (int i = 0; i < 15; i++){  System.out.println(tokens[i] + " = " + tags[i]);}

The process is similar to loading the model file into the model class and then using it on the token array. It outputs content similar to Listing 13.

Listing 13. Part-of-Speech Output

tags: 1704Declaration = NOUNof = ADPIndependence = NOUN: = PUNCTA = DETTranscription = NOUNPrint = VERBThis = DETPage = NOUNNote = NOUN: = PUNCTThe = DETfollowing = VERBtext = NOUNis = AUX

Unlike the name finder model, the part-of-speech tagger performs quite well. It correctly identifies several different parts of speech. Examples in Listing 13 include NOUN, ADP (for adposition), and PUNCT (for punctuation).

Conclusion

In this article, you learned how to add Apache OpenNLP to a Java project and use pre-built models for natural language processing. In some cases, you may need to develop your own models, but pre-existing models often solve the problem. Besides the models demonstrated here, OpenNLP also includes features like document classifiers, lemmatizers (which break words down to their roots), chunkers, and parsers. All of these are fundamental elements of natural language processing systems and are freely available through OpenNLP.