Natural Language Processing (NLP) is one of the most important frontiers in the field of software. Since the advent of digital computing, the fundamental idea—how to effectively use and generate human language—has been a continuous effort. Today, this work continues, with machine learning and databases at the forefront of mastering natural language. This article is a hands-on introduction to Apache OpenNLP, a Java-based machine learning project that provides primitives such as chunking and lemmatization, both of which are essential for building NLP-enabled systems.
What is Apache OpenNLP?
Machine learning natural language processing systems like Apache OpenNLP typically consist of three parts:
-
Learning from a corpus, which is a set of text data (plural: corpora)
-
A model generated from the corpus
-
Using the model to perform tasks on the target text
To simplify things, OpenNLP provides pre-trained models for many common use cases. For more complex requirements, you may need to train your own models. For simpler scenarios, you can just download existing models and apply them to the task at hand.
Detecting Language with OpenNLP
Let’s build a basic application that we can use to understand how OpenNLP works. We can start the layout using a Maven archetype, as shown in Listing 1.
Listing 1. Creating a New Project
~/apache-maven-3.8.6/bin/mvn archetype:generate -DgroupId=com.infoworld.com -DartifactId=opennlp -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
This archetype will build a new Java project. Next, add the Apache OpenNLP dependencies to the pom.xml
in the root directory of the project, as shown in Listing 2. (You can use any version of the OpenNLP dependencies available.)
Listing 2. OpenNLP Maven Dependency
<dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>2.0.0</version></dependency>
To make it easier to execute the program, also add the following entry to the pom.xml
file:
Listing 3. Main Class Execution Goal in Maven POM
<plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>3.0.0</version> <configuration> <mainClass>com.infoworld.App</mainClass> </configuration> </plugin>
Now, run the program using maven compile exec:java
. (You need expertise and have the JDK installed to run this command.) Running it now will give you the familiar “Hello World!” output.
Downloading and Setting Up the Language Detection Model
Now that we are ready to use OpenNLP to detect the language in the sample program, the first step is to download the language detection model. Download the latest language detector component from the OpenNLP model download page. At the time of writing, the current version is langdetect-183.bin. To make the model easier to obtain, let’s navigate to the Maven project and create a new directory at /opennlp/src/main/resources
, and then copy the langdetect-*.bin
file into it. Now, let’s modify the existing file to what you see in Listing 4. We will use /opennlp/src/main/java/com/infoworld/App.java
in this example.
Listing 4. App.java
package com.infoworld;
import java.util.Arrays;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import opennlp.tools.langdetect.LanguageDetectorModel;
import opennlp.tools.langdetect.LanguageDetector;
import opennlp.tools.langdetect.LanguageDetectorME;
import opennlp.tools.langdetect.Language;
public class App { public static void main( String[] args ) { System.out.println( "Hello World!" ); App app = new App(); try { app.nlp(); } catch (IOException ioe){ System.err.println("Problem: " + ioe); } } public void nlp() throws IOException { InputStream is = this.getClass().getClassLoader().getResourceAsStream("langdetect-183.bin"); // 1 LanguageDetectorModel langModel = new LanguageDetectorModel(is); // 2 String input = "This is a test. This is only a test. Do not pass go. Do not collect $200. When in the course of human history."; // 3 LanguageDetector langDetect = new LanguageDetectorME(langModel); // 4 Language langGuess = langDetect.predictLanguage(input); // 5
System.out.println("Language best guess: " + langGuess.getLang());
Language[] languages = langDetect.predictLanguages(input);
System.out.println("Languages: " + Arrays.toString(languages)); }}
Now, you can run the program using the command maven compile exec:java
. When you do this, you will get output similar to that shown in Listing 5.
Listing 5. Language Detection Run 1
Language best guess: engLanguages: [eng (0.09568318011427969), tgl (0.027236092538322446), cym (0.02607472496029117), war (0.023722424236917564)...
In this example, “ME” stands for Maximum Entropy. Maximum entropy is a statistical concept used in natural language processing to optimize the best results.
Evaluating Results
After running the program, you will see that the OpenNLP language detector accurately guessed that the language of the text in the sample program is English. We also output some probabilities derived from the language detection algorithm. After English, it guessed that the language might be Tagalog, Welsh, or War-Jaintia. In defense of the detector, the language sample is small. Correctly identifying the language from just a few sentences without additional context is quite impressive. Before we proceed, let’s review Listing 4. The process is very straightforward. Here’s how each commented line works:
-
Open the
langdetect-183.bin
file as an input stream. -
Use the input stream to parameterize the instantiation of
LanguageDetectorModel
. -
Create a string to serve as input.
-
Create a language detector object using the
LanguageDetectorModel
from line 2. -
Run the
langDetect.predictLanguage()
method on the input from line 3.
Testing Probabilities
If we add more English text to the string and run it again, the probability assigned to eng
should increase. Let’s try this by pasting the content of the Declaration of Independence into a new file in our project directory: /src/main/resources/declaration.txt
. We will load and process it as shown in Listing 6, replacing the inline string:
Listing 6. Loading Declaration Text
String input = new String(this.getClass().getClassLoader().getResourceAsStream("declaration.txt").readAllBytes());
If you run it, you will see that English is still the detected language.
Using OpenNLP to Detect Sentences
You have already seen the language detection model in action. Now, let’s try a model for detecting sentences. First, return to the OpenNLP model download page and add the latest Sentence English model component to the project’s /resources
directory. Note that knowing the language of the text is a prerequisite for detecting sentences. We will follow a similar pattern to what we did with the language detection model: load the file (in my case opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin)
and use it to instantiate a sentence detector. Then, we will use the detector on the input file. You can see the new code (and its imports) in Listing 7; the rest of the code remains unchanged.
Listing 7. Detecting Sentences
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.sentdetect.SentenceDetectorME;
//...InputStream modelFile = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"); SentenceModel sentModel = new SentenceModel(modelFile); SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentModel); String sentences[] = sentenceDetector.sentDetect(input); System.out.println("Sentences: " + sentences.length + " first line: "+ sentences[2])
Running this file will produce output as shown in Listing 8.
Listing 8. Output of Sentence Detector
Sentences: 41 first line: In Congress, July 4, 1776
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, ...
Note that the sentence detector found 41 sentences, which sounds right. Also note that this detector model is very simple: it only looks for periods and spaces to find break points. It has no grammatical logic. This is why we use index 2 on the sentences array to get the actual preamble—the title line is merged into two sentences. (The founding document is notoriously inconsistent with punctuation, and the sentence detector does not attempt to treat “When in the Course ……” as a new sentence.)
Using OpenNLP for Tokenization
After breaking the document into sentences, tokenization is the next level of granularity. Tokenization is the process of breaking a document down into words and punctuation. We can use the code shown in Listing 9:
Listing 9. Tokenization
import opennlp.tools.tokenize.SimpleTokenizer;
//...SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize(input); System.out.println("tokens: " + tokens.length + " : " + tokens[73] + " " + tokens[74] + " " + tokens[75]);
This will produce output as shown in Listing 10.
Listing 10. Tokenizer Output
tokens: 1704 : human events ,
Thus, the model breaks the document into 1704 tokens. We can access the token array, where the word “human events” and the following comma each occupy one element.
Using OpenNLP for Name Finding
Now, we will get the English “Name Finder” model, called en-ner-person.bin. This model is located on the Sourceforge model download page. After obtaining the model, place it in the project’s resources directory and use it to find names in the document, as shown in Listing 11.
Listing 11. Using OpenNLP for Name Finding
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.util.Span;
//...InputStream nameFinderFile = this.getClass().getClassLoader().getResourceAsStream("en-ner-person.bin"); TokenNameFinderModel nameFinderModel = new TokenNameFinderModel(nameFinderFile); NameFinderME nameFinder = new NameFinderME(nameFinderModel); Span[] names = nameFinder.find(tokens); System.out.println("names: " + names.length); for (Span nameSpan : names){ System.out.println("name: " + nameSpan + " : " + tokens[nameSpan.getStart()-1] + " " + tokens[nameSpan.getEnd()-1]);}
In Listing 11, we load the model and use it to instantiate a NameFinderME
object. We then use it to get an array of names modeled as span objects. The span has a start and end, telling us where the detector thinks the name begins and ends in the token set. Note that the name finder requires a tokenized string array.
Using OpenNLP for Part-of-Speech Tagging
OpenNLP allows us to tag parts of speech (POS) based on the tokenized strings. Listing 12 is an example of POS tagging.
Listing 12. Part-of-Speech Tagging
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
//…InputStream posIS = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-pos-1.0-1.9.3.bin");POSModel posModel = new POSModel(posIS);POSTaggerME posTagger = new POSTaggerME(posModel);String tags[] = posTagger.tag(tokens);System.out.println("tags: " + tags.length);
for (int i = 0; i < 15; i++){ System.out.println(tokens[i] + " = " + tags[i]);}
The process is similar to loading the model file into the model class and then using it on the token array. It outputs content similar to Listing 13.
Listing 13. Part-of-Speech Output
tags: 1704Declaration = NOUNof = ADPIndependence = NOUN: = PUNCTA = DETTranscription = NOUNPrint = VERBThis = DETPage = NOUNNote = NOUN: = PUNCTThe = DETfollowing = VERBtext = NOUNis = AUX
Unlike the name finder model, the part-of-speech tagger performs quite well. It correctly identifies several different parts of speech. Examples in Listing 13 include NOUN, ADP (for adposition), and PUNCT (for punctuation).
Conclusion
In this article, you learned how to add Apache OpenNLP to a Java project and use pre-built models for natural language processing. In some cases, you may need to develop your own models, but pre-existing models often solve the problem. Besides the models demonstrated here, OpenNLP also includes features like document classifiers, lemmatizers (which break words down to their roots), chunkers, and parsers. All of these are fundamental elements of natural language processing systems and are freely available through OpenNLP.