Archive

Monthly Archives: April 2012

I gave a talk on “Text Geolocation and (Text) Dating” last week at Johns Hopkins University’s Center for Language and Speech Processing. It covers some of our recent work at UT Austin on automaticall geolocating and time-stamping documents. It also includes some initial thoughts on how this can help us move toward quasi-grounded models of word meaning, which is a research area that Katrin Erk and I have recently begun to work on. Here’s the abstract:

It used to be that computational linguists had to collaborate with roboticists in order to work on grounding language in the real world. However, since the advent of the internet, and particularly in the last decade, the world has been brought within digital scope. People’s social and business interactions are increasingly mediated through a medium that is dominated by text. They tweet from places, express their opinions openly, give descriptions of photos, and generally reveal a great deal about themselves in doing so, including their location, gender, age, social status, relationships and more. In this talk, I’ll discuss work on geolocation and dating of texts, that is, identifying a sets of latitude-longitude pairs and time periods that a document is about or related to. These applications and the models developed for them set the stage for deeper investigations into computational models of word meaning that go beyond standard word vectors and into augmented multi-component representations that include dimensions connected to the real world via geographic and temporal values and beyond.

CLSP records all their seminars, and the video for my talk has been posted.

Jason Baldridge – “Text Geolocation and Dating: Light-Weight Language Grounding” from Chris Callison-Burch on Vimeo.

Unfortunately, the slides are not visible in the talk, so I’ve posted them here: slides for Text Geolocation and Dating.

The code that has been used for most of the research discussed is open source, and available on Github: Textgrounder. You can see related publications on my papers page.

Errata: I believe I say at one point that Alessandro Moschitti had created the TweetToLife application that I used to get temporal distributions on Twitter over the day, but that was actually Marco Baroni I had in mind (joint work with Amaç Herdağdelen). Getting my Italians mixed-up! (I had just read a nice paper by Moschitti, so my prior was off…)

Topics: natural language processing, OpenNLP, SBT, Maven, resources, sentence detection, tokenization, part-of-speech tagging

Introduction

Natural language processing involves a wide-range of methods and tasks. However, we usually start with some raw text and start by demarcating what the sentences and the tokens are. We then go on to further levels of processing, such as predicting part-of-speech tags, syntactic chunks, named entities, syntactic structures, and more.

This tutorial has two goals. First, it shows how to use the OpenNLP Tools as an API for doing sentence detection, tokenization, and part-of-speech tagging. It also shows how to add new dependencies and resources to a system like Scalabha and then using those to add new functionality. As prerequisites, see previous tutorials on getting used to working with the SBT build system of Scalabha and adding new code to the existing build system. To see the other tutorials in this series, check out the list on the links page of my Applied Text Analysis course. Of particular relevance is the one on SBT, Scalabha, packages, and build systems.

To do this tutorial, you should be working with Scalabha version 0.2.3. By the end, you should have recreated version 0.2.4, allowing you to check your progress if you run into any problems.

Adding OpenNLP Tools as a dependency

To use OpenNLP’s API, we need to have access to its jar (Java ARchive) files such that our code can compile using classes from the API and then later be executed. It is important at this point to distinguish between explicitly putting a jar file in your build system versus making it available as a managed dependency. To see some explicitly added (unmanaged) dependencies in Scalabha, look at $SCALABHA_DIR/lib.

$ ls $SCALABHA_DIR/lib
Jama-1.0.2.jar          pca_transform-0.7.2.jar
crunch-0.2.0.jar        scrunch-0.1.0.jar

These have been added to the Scalabha repository and are available even before you do any compilation. You can even see them listed in the Scalabha repository on Github.

In contrast, there are many managed dependencies. When you first download Scalabha, you won’t see them, but once you compile, you can look in the lib_managed directory and will find it is populated

$ ls $SCALABHA_DIR/lib_managed
bundles jars    poms

You can go looking into the jars sub-directory to see some of the jars that have been brought in.

To see where these came from, look in the file $SCALABHA_DIR/build.sbt, which declares much of the information that the SBT program needs in order to build the Scalabha system. The dependencies are given in the following declaration.

libraryDependencies ++= Seq(
"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",
"org.clapper" %% "argot" % "0.3.8",
"org.apache.commons" % "commons-lang3" % "3.0.1",
"commons-logging" % "commons-logging" % "1.1.1",
"log4j" % "log4j" % "1.2.16",
"org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
"junit" % "junit" % "4.10" % "test",
"com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point...

Notice that the OpenNLP Maxent toolkit is in there (along with others), but not the OpenNLP Tools. The Maxent toolkit is used by the OpenNLP Tools (and is part of the same software group/effort), but it can be used independently of it. For example, it is used for the classification homework for the Applied Text Analysis class I’m teaching this semester, which is in fact why the dependency is already in Scalabha v0.2.3.

So, how does one know to write the following to get the OpenNLP Maxent Toolkit as a dependency?

"org.apache.opennlp" % "opennlp-maxent" % "3.0.2-incubating",

I’m not going to go into lots of detail on this, but basically this is what is known as a Maven dependency. On the OpenNLP home page, there is a page for the OpenNLP Maven dependency.  Look on that page to where it defines the OpenNLP Maxent Dependency, repeated here.

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-maxent</artifactId>
  <version>3.0.2-incubating</version>
</dependency>

The group ID indicates the organization that is responsible for the artifact (e.g. a given organization can have many different systems that it develops and deploys in this manner). The artifact ID is the name of that particular artifact to distinguish it from others by the same organization, and the version is obviously the particular version number of that artifact. (This makes it possible to use older versions as and when needed.)

The XML above is what one needs if one is using the Maven build system, which many Java projects use. SBT is compatible with such dependencies, but the terser format given above is used instead of XML.

We now want to add the OpenNLP Tools as a dependency. From the OpenNLP dependencies page we see that it is declared this way in XML.

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.5.2-incubating</version>
</dependency>

That means we just need to add the following line to build.sbt in the libraryDependencies declaration.

"org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",

And, we can remove the Maxent declaration because OpenNLP Tools depends on it (though it isn’t necessarily a problem if it stays in Scalabha’s build.sbt). The library dependencies should now look as follows.

libraryDependencies ++= Seq(
  "org.apache.opennlp" % "opennlp-tools" % "1.5.2-incubating",
  "org.clapper" %% "argot" % "0.3.8",
  "org.apache.commons" % "commons-lang3" % "3.0.1",
  "commons-logging" % "commons-logging" % "1.1.1",
  "log4j" % "log4j" % "1.2.16",
  "org.scalatest" % "scalatest_2.9.0" % "1.6.1" % "test",
  "junit" % "junit" % "4.10" % "test",
  "com.novocode" % "junit-interface" % "0.6" % "test->default") //switch to ScalaTest at some point...

The next time you run scalabha build, SBT will read the new dependency declaration and retrieve the dependency. At this point, you might say “What?” How is that sufficient to get the required jars? Here’s how, briefly and at a high level. The OpenNLP artifacts are available on the Maven2 site, and SBT already knows to look there. Put simply, it knows to check this site:

http://repo1.maven.org/maven2

And given that the organization is org.apache.opennlp it knows to then look in this directory:

http://repo1.maven.org/maven2/org/apache/opennlp/

Given that we want the opennlp-tools artifact, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/

And finally, given that we want the version 1.5.2-incubating, it looks here:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/

In that directory are all the files that SBT needs to pull down to your local machine, plus information about any dependencies of OpenNLP Tools that it needs to grab. Here is the main jar:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.jar

And here is the POM (“Project Object Model”), for OpenNLP Tools:

http://repo1.maven.org/maven2/org/apache/opennlp/opennlp-tools/1.5.2-incubating/opennlp-tools-1.5.2-incubating.pom

Notice that it includes a reference to OpenNLP Maxent in it, which is why Scalabha’s build.sbt no longer needs to include it explicitly. In fact, it is better to not have it in Scalabha’s build.sbt so that we ensure that the version used by OpenNLP Tools is the one we are using (which matters when we update to, say, a later version of OpenNLP Tools).

In many cases, such artifacts are not hosted at repo1.maven.org. In such cases, you must add a “resolver” that points to another site that contains artifacts. This is done by adding to the resolvers declaration, which is shown here for Scalabha v0.2.3.

resolvers ++= Seq(
  "Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
  "Thrift location" at "http://people.apache.org/~rawson/repo/"
)

So, when dependencies are declared, SBT will also search through those locations, in addition to its defaults, to find them and pull them down to your machine. As it turns out, OpenNLP has a dependency on the Java WordNet Library, which is hosted on a non-standard Maven repository (which is associated with OpenNLP’s old development site on Sourceforge). You should update build.sbt to be the following:

resolvers ++= Seq(
  "Cloudera Hadoop Releases" at "https://repository.cloudera.com/content/repositories/releases/",
  "Thrift location" at "http://people.apache.org/~rawson/repo/",
"opennlp sourceforge repo" at "http://opennlp.sourceforge.net/maven2"
)

That was a lot of description, but note that it was a simple change to build.sbt and now we can use the OpenNLP Tools API.

Tip: if you already had SBT running (e.g. via scalabha build) then you must use the reload command at the SBT command after you change build.sbt in order for SBT to know about the changes.

What do you do if the library you want to use isn’t available as a Maven artifact? In that case, you need to put the jar (or jars) for that library, plus any jars it depends on, into the $SCALABHA_DIR/lib directory. Then SBT will see that they are there and add them to your classpath, enabling you to use them just as if they were a managed dependency. The downside is that you must put it there explicitly, which means a bit more hassle when you want to update to later versions, and a fair amount more hassle if that library has lots of dependencies that you also need to manage.

Obtaining and installing the OpenNLP sentence detector model

Now on to the processing of language. Sentence detection simply refers to the basic process of taking a text and identifying the character positions that indicate sentence breaks. As a running example, we’ll use the first several sentences from the Penn Treebank.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

Note that the “.” character is not a reliable indicator of what is the end of a sentence. While one can build a regular expression based sentence detector, machine learned models are typically used to figure this out, based on some reasonable numbers of example sentences identified as such by a human.

Roughly and somewhat crudely speaking, a machine learned model is a set of features that are associated with real-valued weights which have been determined from some training material. Once these weights have been learned, the model can be saved and reused (e.g. see the classification homework for Applied Text Analysis).

OpenNLP has pretrained models available for several NLP tasks, including sentence detection. Note also that there is an effort I’m heading to make it possible to distribute and, where possible, rebuild models — see the OpenNLP Models Github repository.

We want to do English sentence detection, so the model we need right now is the en | Sentence Detector. Rather than putting it in some random place on your computer, we’ll add it as part of the Scalabha build system and exploit this to simplify the loading of models (more on this later). Recall that the $SCALABHA_DIR/src/main/scala directory is where the actual code of Scalabha is kept (and is also where you can add additional code to do your own tasks, as covered in the previous tutorials). If you look at the $SCALABHA_DIR/src/main directory, you’ll see an additional resources directory. Go there and list the directory contents:

$ cd $SCALABHA_DIR/src/main/resources
$ ls
log4j.properties

All that is there now is a properties file that defines default logging behavior (which is a good way to output debugging information, e.g. as it is done in the opennlp.scalabha.cluster package used in the clustering homework of Applied Text Analysis). What is very nice about the resources directory is that any files in it are accessible in the classpath of the application we are building. That won’t make total sense right away, but it will be clear as we go along — the end result is that it simplifies a number of things a great deal, so bear with me.

What we are going to do now is place the sentence detector model in a subdirectory of resources that will give us access to it, and also organize things for future additions (wrt languages and systems). So, do the following:

$ cd $SCALABHA_DIR/src/main/resources
$ mkdir -p lang/eng/opennlp
$ cd lang/eng/opennlp/
$ wget http://opennlp.sourceforge.net/models-1.5/en-sent.bin
--2012-04-10 12:24:42--  http://opennlp.sourceforge.net/models-1.5/en-sent.bin
Resolving opennlp.sourceforge.net... 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98533 (96K) [application/octet-stream]
Saving to: `en-sent.bin'

100%[======================================>] 98,533       411K/s   in 0.2s

2012-04-10 12:24:43 (411 KB/s) - `en-sent.bin' saved [98533/98533]

Note: the last command uses the program wget, which may not be available on your machine. If that is the case, you can download en-sent.bin in your browser (using the link given after wget above) and move it to the directory $SCALABHA_DIR/src/main/resources/lang/eng/opennlp. (Better yet, install wget since it is so useful…)

Status check: you should now see en-sent.bin when you do the following:

$ ls $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
en-sent.bin

Using the sentence detector

Let’s now use the model! That requires create an example application that will read in the model, construct a sentence detector object from it, and then apply it to some example text. Do the following:

$ touch $SCALABHA_DIR/src/main/scala/opennlp/scalabha/tag/OpenNlpTagger.scala

This creates an empty file at that location that you should now open in a text editor. Add the following Scala code (to be explained) to that file:.

package opennlp.scalabha.tag

object OpenNlpTagger {
  import opennlp.tools.sentdetect.SentenceDetectorME
  import opennlp.tools.sentdetect.SentenceModel

  lazy val sentenceDetector =
    new SentenceDetectorME(
      new SentenceModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

  def main (args: Array[String]) {
    val test = io.Source.fromFile(args(0)).mkString
    sentenceDetector.sentDetect(test).foreach(println)
  }

}

Here are the relevant bits of explanation needed to understand what is going on. We need to import the SentenceDetectorME and SentenceModel classes (you should verify that you can find them in the OpenNLP API). The former is a class for sentence detectors that are based on trained maximum entropy models, and the latter is for holding such models. We then must create our sentence detector. This is where we get the advantage of having put it into the resources directory of Scalabha. We obtain it by getting the Class of the object (via this.getClass) and then using the getResourceAsStream method of the Class class. That’s a bit meta, but it boils down to enabling you to just follow this recipe for getting the resource. The return value of getResourceAsStream is an InputStream, which is what is needed to construct a SentenceModel.

Once we have a SentenceModel, that can be used to create a SentenceDetectorME. Note that the sentenceDetector object is declared as a lazy val. By doing this, the model is only loaded when we need it. For a small program like this one, this doesn’t matter much, but in a larger system with many components, using lazy vals allows the application to get fired up much more quickly and then load thing like models on demand. (You’ll actually see a nice, concrete example of this by the end of the tutorial.) In general, using lazy vals is a good idea.

We then just need to get some text and use the sentence detector. The application gets a file name from the command line and then reads in its contents. The sentence detector has a method sentDectect (see the API) that takes a String and returns an Array[String], where each element of the Array is a sentence. So, we run sentDetect on the input text and then print out each line.

Once you have added the above code to OpenNlpTagger.scala, you should compile in SBT (I recommend using ~compile so that it compiles every time you make a change). Then, do the following:

$ cd /tmp
$ echo "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate." > vinken.txt
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

So, the model does perfectly on these sentences (but don’t expect it to do quite so well on other domains, such as Twitter). We are now ready to do the next step of splitting up the characters in each sentence into tokens.

Tokenizing

Once we have identified the sentences, we need to tokenize them to turn them into a sequence of tokens where each token is a symbol or word (conforming to some predefined notion of what is a “word”). For example, the tokens for the first sentence of the running example are the following, where a token is indicated via space:

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Most NLP tools then build on these units.

To enable tokenization, we must first make the English tokenizer available as a resource in Scalabha.

$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-token.bin
--2012-04-10 14:21:14--  http://opennlp.sourceforge.net/models-1.5/en-token.bin
Resolving opennlp.sourceforge.net... 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439890 (430K) [application/octet-stream]
Saving to: `en-token.bin'

100%[========================================================================>] 439,890      592K/s   in 0.7s

2012-04-10 14:21:16 (592 KB/s) - `en-token.bin' saved [439890/439890]

Then, change OpenNlpTagger.scala to have the following contents.

package opennlp.scalabha.tag

object OpenNlpTagger {
  import opennlp.tools.sentdetect.SentenceDetectorME
  import opennlp.tools.sentdetect.SentenceModel
  import opennlp.tools.tokenize.TokenizerME
  import opennlp.tools.tokenize.TokenizerModel

  lazy val sentenceDetector =
    new SentenceDetectorME(
      new SentenceModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

  lazy val tokenizer =
    new TokenizerME(
      new TokenizerModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

  def main (args: Array[String]) {
    val test = io.Source.fromFile(args(0)).mkString
    val sentences = sentenceDetector.sentDetect(test)
    val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
    tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))
  }

}

The process is very similar to what was done for the sentence detector. The only difference is that we now use the tokenizer’s tokenize method on each sentence. This method returns an Array[String], where each element is a token. We thus map the Array[String] of sentences to the Array[Array[String]] of tokenizedSentences. Simple!

Make sure to test that everything is working.

$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

Now that we have these tokens, the input is ready for part-of-speech tagging.

Part-of-speech tagging

Part-of-speech (POS) tagging involves identifing whether each token is a noun, verb, determiner and so on. Some part-of-speech tag sets have more detail, such as NN for a singular noun and NNS for a plural one. See the previous tutorial on iteration for more details and pointers.

The OpenNLP POS tagger is trained on the Penn Treebank, so it uses that tagset. As with the other models, we must download it and place it in the resources directory.

$ cd $SCALABHA_DIR/src/main/resources/lang/eng/opennlp
$ wget http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
--2012-04-10 14:31:33--  http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
Resolving opennlp.sourceforge.net... 216.34.181.96
Connecting to opennlp.sourceforge.net|216.34.181.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5696197 (5.4M) [application/octet-stream]
Saving to: `en-pos-maxent.bin'

100%[========================================================================>] 5,696,197    671K/s   in 8.2s

2012-04-10 14:31:42 (681 KB/s) - `en-pos-maxent.bin' saved [5696197/5696197]

Then, update OpenNlpTagger.scala to have the following contents, which involve some additional output over what you saw the previous times.

package opennlp.scalabha.tag

object OpenNlpTagger {

  import opennlp.tools.sentdetect.SentenceDetectorME
  import opennlp.tools.sentdetect.SentenceModel
  import opennlp.tools.tokenize.TokenizerME
  import opennlp.tools.tokenize.TokenizerModel
  import opennlp.tools.postag.POSTaggerME
  import opennlp.tools.postag.POSModel

  lazy val sentenceDetector =
    new SentenceDetectorME(
      new SentenceModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-sent.bin")))

  lazy val tokenizer =
    new TokenizerME(
      new TokenizerModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-token.bin")))

  lazy val tagger =
    new POSTaggerME(
      new POSModel(
        this.getClass.getResourceAsStream("/lang/eng/opennlp/en-pos-maxent.bin")))

def main (args: Array[String]) {

  val test = io.Source.fromFile(args(0)).mkString

  println("\n*********************")
  println("Showing sentences.")
  println("*********************")
  val sentences = sentenceDetector.sentDetect(test)
  sentences.foreach(println)

  println("\n*********************")
  println("Showing tokens.")
  println("*********************")
  val tokenizedSentences = sentences.map(tokenizer.tokenize(_))
  tokenizedSentences.foreach(tokens => println(tokens.mkString(" ")))

  println("\n*********************")
  println("Showing POS.")
  println("*********************")
  val postaggedSentences = tokenizedSentences.map(tagger.tag(_))
  postaggedSentences.foreach(postags => println(postags.mkString(" ")))

  println("\n*********************")
  println("Zipping tokens and tags.")
  println("*********************")
  val tokposSentences =
    tokenizedSentences.zip(postaggedSentences).map { case(tokens, postags) =>
      tokens.zip(postags).map { case(tok,pos) => tok + "/" + pos }
    }
  tokposSentences.foreach(tokposSentence => println(tokposSentence.mkString(" ")))

  }

}

Everything is as before, so it should be pretty much self-explanatory. Just note that the tagger’s tag method takes a token sequence (Array[String], written as String[] in OpenNLP’s Javadoc) as its input and it returns an Array[String] of the tags for each token. Thus, when we output the postaggedSentences in the “Showing POS” part, it prints only the tags. We can then bring the tokens and their corresponding tags together by zipping the tokenizedSentences with the postaggedSentences and then zipping the word and POS tokens in each sentence together, as shown in the “Zipping tokens and tags” portion.

When this is run, you should get the following output.

$ cd /tmp
$ scalabha run opennlp.scalabha.tag.OpenNlpTagger vinken.txt

*********************
Showing sentences.
*********************
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

*********************
Showing tokens.
*********************
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

*********************
Showing POS.
*********************
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
NNP NNP VBZ NN IN NNP NNP , DT JJ NN NN .
NNP NNP , CD NNS JJ CC JJ NN IN NNP NNP NNP NNP , VBD VBN DT NN IN DT JJ JJ NN .

*********************
Zipping tokens and tags.
*********************
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/JJ publishing/NN group/NN ./.
Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC former/JJ chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DT director/NN of/IN this/DT British/JJ industrial/JJ conglomerate/NN ./.

Note: You’ll probably notice a pause just after it says “Showing POS” — that is because the tagger is defined as a lazy val, so the model is loaded at that time since it is the first point where it is needed. Try removing “lazy” from the declarations of sentenceDetector, tokenizer, and tagger, recompiling and then running it again — you’ll now see that the pause before anything is done is greater, but that once it starts processing everything goes very quickly. That’s a fairly good way of seeing part of why lazy values are quite handy.

And that’s it. To see the output on a longer example, you can run it on any text you like, e.g. the ones in the Scalabha’s data directory, like the Federalist Papers:


$ scalabha run opennlp.scalabha.tag.OpenNlpTagger $SCALABHA_DIR/data/cluster/federalist/federalist.txt

Now as an exercise, turn the standalone application, defined as the object OpenNlpTagger, into a class, OpenNlpTagger, that takes a raw text as input (not via the command line, but as an argument to a method and returns a List[List[(String,String)]] that contains the sentences and for each sentence a sequence of (token,tag) pairs. For example, after running it on the Vinken text, you should produce the following.

List(List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ), (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT), (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.)), List((Mr.,NNP), (Vinken,NNP), (is,VBZ), (chairman,NN), (of,IN), (Elsevier,NNP), (N.V.,NNP), (,,,), (the,DT), (Dutch,JJ), (publishing,NN), (group,NN), (.,.)), List((Rudolph,NNP), (Agnew,NNP), (,,,), (55,CD), (years,NNS), (old,JJ), (and,CC), (former,JJ), (chairman,NN), (of,IN), (Consolidated,NNP), (Gold,NNP), (Fields,NNP), (PLC,NNP), (,,,), (was,VBD), (named,VBN), (a,DT), (director,NN), (of,IN), (this,DT), (British,JJ), (industrial,JJ), (conglomerate,NN), (.,.)))

Spans

You may notice that the sentence detector and tokenizer APIs both include methods that return Array[Span] (note: Span[] in OpenNLP’s Javadoc). These are preferable in many contexts since they don’t lose information from the original text, unlike the ones we used above which turned the original text into sequences of portions of the original. Spans just record the character offsets at which the sentences start and end, or at which tokens start and end. This is quite handy for further processing and is what is generally used in non-trivial applications. But, for many cases, the methods that return Array[String] will be just fine and require learning a bit less.

Conclusion

This tutorial has taken you from a version of Scalabha that does not have the OpenNLP Tools API available to a version which does have it and also has several pretrained models available and an example application to use the API for part-of-speech tagging. You can of course follow similar recipes for bringing in other libraries and using them in your code, so this setup gives you a lot of power and is easy to use once you’ve done it a few times. If you have any trouble, or want to check it against a definitely working version, get Scalabha v0.2.4, which differs from v0.2.3 primarily only with respect to this tutorial.

A final note: you may be wondering what the heck OpenNLP is, given that Scalabha’s classpath starts with opennlp.scalabha, but we were adding the OpenNLP Tools as a dependency. Basically, Gann Bierner and I started OpenNLP in 1999, and part of the goal of that was to provide a high-level organizational domain name so that we could ensure uniqueness in classpaths. So, we have opennlp.tools, opennlp.maxent, opennlp.scalabha, and there are others. These are thus clearly different, in terms of their unique classpaths, from foo.tools, foo.maxent, and so on. So, when I started Scalabha, I used opennlp.scalabha (though in all likelihood, no one else would pick scalabha as a top-level for a class path). Nonetheless, when one speaks of OpenNLP generally, it usually refers to the OpenNLP Tools, the first of the projects to be in the OpenNLP “family”.

Copyright 2012 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to http://www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.